For  Reference 


NOT  TO  BE  TAKEN  FROM  THIS  ROOM 


(3x  H  B  BIS 


The  University  of  Alherta 
Printing  Department 
Edmonton,  Alberta 


UNIVERSITY  OF  ALBERTA 
LIBRARY 


Regulations  Regarding  Theses  and  Dissertations 


Typescript  copies  of  theses  and  dissertations  for  Master’s  and  Doctor's 
degrees  deposited  in  the  University  of  Alberta  Library,  as  the  official  Copy  of 
the  Faculty  of  Graduate  Studies,  may  he  consulted  in  the  Reference  Reading  Room 
only. 


A  second  copy  is  on  deposit  in  the  Department  under  whose  supervision  the 
work  was  done.  Some  Departments  are  willing  to  loan  their  copy  to  libraries, 
through  the  inter-library  loan  service  of  the  University  of  Alberta  Library. 

These  theses  and  dissertations  are  to  be  used  only  with  due  regard  to  the 
rights  of  the  author.  Written  permission  of  the  author  and  of  the  Department 
must  be  obtained  through  the  University  of  Alberta  Library  when  extended  passages 
are  copied.  When  permission  has  been  granted,  acknowledgement  must  appear  in  the 
published  work. 

This  thesis  or  dissertation  has  been  used  in  accordance  with  the  above 
regulations  by  the  persons  listed  below.  The  borrowing  library  is  obligated  to 
secure  the  signature  of  each  user. 


Please  sign  below: 


THE  UNIVERSITY  OF  ALBERTA 


AN  ON-LINE  INFORMATION  RETRIEVAL  SYSTEM 
WITH  AN  APPLICATION  TO  WESTERN  CANADIAN  HISTORY 

by 

Roger  F.  Halpin 


A  THESIS 

SUBMITTED  TO  THE  FACULTY  OF  GRADUATE  STUDIES 
IN  PARTIAL  FULFILMENT  OF  THE  REQUIREMENTS  FOR  THE  DEGREE 

OF  MASTER  OF  SCIENCE 


DEPARTMENT  OF  COMPUTING  SCIENCE 
EDMONTON,  ALBERTA 


SEPTEMBER,  1967 


2IS3HT  A 


\ 


/  ■  '  ' 


1 


■ 

' 


UNIVERSITY  OF  ALBERTA 


FACULTY  OF  GRADUATE  STUDIES 


The  undersigned  certify  that  they  have  read,  and 
recommend  to  the  Faculty  of  Graduate  Studies  for  acceptance 
a  thesis  entitled  AN  ON-LINE  INFORMATION  RETRIEVAL  SYSTEM 
WITH  AN  APPLICATION  TO  WESTERN  CANADIAN  HISTORY  submitted 
by  Roger  F.  Halpin  in  partial  fulfilment  of  the  requirement 
for  the  degree  of  Master  of  Science . 


Dat  e  .  v .  .4  £ .  / 1  $ 


* 


A 


i 


ABSTRACT 


This  thesis  reviews  problems  in  the  information 
storage  and  retrieval  cycle  and  describes  the  development 
of  an  experimental  on-line  storage  and  retrieval  system 
(SARA)  with  a  present  data-base  of  documents  in  Western 
Canadian  history.  The  review  covers  methods  for 
converting  information  in  documents  to  machine  readable 
form,  and  the  automatic  analysis  of  this  information  to 
produce  indexed  documents,  abstracts,  and  classifications 
it  describes  briefly  seven  operational  information 
storage  and  retrieval  systems.  SARA,  which  utilizes 
time-shared  computing  facilities  and  a  new  programming 
language  called  APL,  is  described  in  detail  and  evaluated 


< 


animnsisoiq  wsn  s  fin*  efe&Uort  Snl^uq moo.  (wvfila-Mlt  .  ' 


bsJsuLsve  bns  ItBJeb  nl  b*6lio**b  si  ,JSA  boLLao  asB**nsI 


-v 


. 

■ 


ACKNOWLEDGEMENTS 


I  express  my  appreciation  to  Professor  K.W.  Smillie 
for  the  guidance  given  me  in  the  preparation  of  this  thesis, 
to  Professor  Doreen  Mo  Heaps  for  her  interest  and  assistance 
in  this  topic,  and  to  Professor  D.B.  Scott,  Head  of  the 
Department  of  Computing  Science,  for  providing  computing 
facilities  and  financial  assistance  while  this  research  was 
being  done,  I  also  wish  to  thank  the  Department  of  History, 
and  in  particular  Mr.  J.  Nicks,  for  the  willing  help  given 
in  the  preparation  of  test  material. 


■ 


TABLE  OF  CONTENTS 


Page 

CHAPTER  I  -  GENERAL  THESIS  AND  PROBLEMS  IN  1 

INFORMATION  SYSTEMS 

CHAPTER  II  -  TEXT  CONVERSION  TO  MACHINE 

READABLE  FORM 

2.0  Introduction  8 

2.1  n  -  Tuple  Methods  9 

2.2  Random  Nets  12 

2.3  Template  Matching  16 

2.4  Analytic  Methods  17 

2.5  Discussion  20 

CHAPTER  III  -  AUTOMATIC  ANALYSIS  OF  TEXT  MATERIAL 

3.0  Introduction  22 

3.1  Automatic  Indexing  23 

3.2  Automatic  Abstracting  32 

3.3  Automatic  Classification  33 

3.4  Other  Areas  of  Computer  Analysis  of 

Text  39 

3.5  Discussion  40 

CHAPTER  IV  -  OPERATIONAL  INFORMATION  STORAGE  AND 

RETRIEVAL  SYSTEMS 

4.0  Introduction  4l 

4.1  Batch  -  Processing  Systems  42 

4.1.1  The  PICUPS  System  42 

4.1.2  The  MEDLARS  System  44 

4.1.3  The  HAYSTAQ  System  47 

4.2  Real-Time,  Time-Shared  Systems  51 

4.2.1  The  CONVERSE  System  52 

4.2.2  The  Technical  Information 

Project  54 

4.2.3  The  SMART  System  58 

4.2.4  The  BOLD  System  67 


Page 


CHAPTER  V  -  THE  SARA  SYSTEM 

5.0  Introduction  72 

5.1  A  General  Description  of  the  SARA 

System  73 

5.1.1  The  Hardware  Environment  and 

the  Programming  Language  74 

5.1.2  The  Control  Subsystem  75 

5.1.3  The  Storage  Subsystem  77 

5.1.4  The  Retrieval  Subsystem  79 

5.2  Details  of  the  Operation  of  the 

System  82 

CHAPTER  VI  -  COMPARISON  AND  EVALUATION  OF  THE 

SARA  SYSTEM 

6.0  Introduction  97 

6.1  APL  As  a  General  Programming  Language  97 

6.2  SARA  and  Other  On-Line  Systems  100 

6.3  Strengths  and  Weaknesses  of  the 

SARA  System  102 

BIBLIOGRAPHY  104 

APPENDIX  A  -  SELECTION  AND  INDEXING  OF  DOCUMENTS  111 

APPENDIX  B  -  BLOCK  DIAGRAMS  AND  LISTINGS  OF 

ROUTINES  IN  SARA  130 

APPENDIX  C  -  EXAMPLES  OF  USE  OF  SARA  154 


1 . 5 


3HT  30  ViOXTAlUAVa  QUA  H08IHA3M00  -  IV  H3T3AH0 


■ 


2Tt:3::iJC0a  30  DWIX3CMI  :JH a  K01T.  122  -  A  XIGH3S3A 


asas  u  30  ?>„.«r-iAxa  -  o xiaw3<m 

■ 


LIST  OF  FIGURES 


Figure 


1 — 1 
o 

o 

1 — 1 

Page 

4 

1 — 1 

• 

o 

o 

C\J 

10 

2.2.1 

13 

i — 1 

• 

i — 1 

• 

CO 

26 

4.2.1 

62 

c\j 

• 

C\J 

• 

64 

i — 1 

l — 1 
• 

LT\ 

76 

5.1.2 

87 

5.1.3 

88 

5  =  1.4 

92 

5.1.5 

92 

5.1.6 

96 

c-~ 

i — 1 

• 

i_n 

96 

A .  1 

112 

A.  2 

114 

A. 3 

124 

‘ 


CHAPTER  I 


GENERAL  THESIS  AND 
PROBLEMS  IN  INFORMATION  SYSTEMS 

The  phrase  "information  storage  and  retrieval"  can 
take  many  meanings.  For  example,  the  local  library  and  the 
office  filing  system  can  be  considered  information  storage 
and  retrieval  systems.  In  the  present  thesis,  the  phrase 
will  refer  to  systems  which  prepare  and  store  documents  and 
subsequently  retrieve  them  (or  their  addresses)  in  response 
to  requests.  The  systems  may  employ  humans  and  machines 
such  as  computers. 

The  term  "information"  has  been  used  in  a  multitude 
of  ways.  In  this  thesis,  the  term  will  refer  to  the 
characters,  which,  when  ordered,  form  words.  These  char¬ 
acters  may  be  printed  or  written  on  paper.  The  term  will 
not  imply  that  interpretation  on  the  string  of  characters 
is  done,  or  that  meaning  is  derived  from  the  string  of 
characters . 

Information  storage  includes  the  preparation  of 
information  which  exists  in  document  form  and  the  storage 
of  this  information  in  some  systematic  manner  suitable  for 
subsequent  retrieval.  There  are  various  definitions  of 
information  retrieval.  Vickery  ( 1 9 6 5 )  states  that  "Retrieval 
is  the  selection  of  documentary  information  from  a  store,  in 


\ 


o  iSr  ■  no  ISboi'-i  :  i  rfi 


*  lo 


response  to  search  questions.".  Bourne  (1963)  differ¬ 
entiates  between  reference  retrieval,  document  retrieval, 
fact  retrieval  and  information  retrieval.  He  defines 
document  retrieval  as  a  process  which  yields  a  complete 
copy  of  a  document  in  response  to  a  general  search  question, 
while  information  retrieval  is  a  process  which  yields  in¬ 
formation  to  a  request  for  information.  Such  a  request 
might  be:  "What  is  the  difference  between  a  point-contact 
transistor  and  a  junction  transistor?".  Kent  (1962)  gives 
a  more  general  definition  of  machine  literature  searching, 
or  information  retrieval:  "...the  use  of  mechanized  or 
other  non-convent ional  tools  in  connection  with  any  one 
or  more  unit  operations"  where  a  unit  operation  is  "...a 
series  of  functions,  or  steps...". 

Information  storage  and  retrieval  systems  have  been 
in  operation  since  knowledge  has  been  recorded.  An  example 
of  such  a  system  is  a  library.  Information  in  the  form  of 
printed  characters  comprising  documents  is  stored  on  shelves 
and  indexed  and  catalogued  for  retrieval.  In  order  to  re¬ 
trieve  the  information  in  the  documents,  the  index  cards 
are  consulted  and  the  proper  documents  are  retrieved. 

The  volume  of  published  material  has  been  increasing 
at  an  exponential  rate  (see  Bourne  (1963))*  Traditional 
techniques  used  to  process  the  large  volume  of  information 
cannot  cope  with  the  information  explosion;  new  techniques 


, 


.  IfiVsl'iSsi  10*1  briB  bexabnl,i>^ 


. 


3 


must  be  developed.  Hence,  machines,  particularly  computers, 
are  being  used  to  ease  the  effect  of  this  explosion.  This 
thesis  concentrates  on  several  specific  applications  of 
computers  to  information  storage  and  retrieval  although 
it  acknowledges  the  importance  of  other  non-computer  phases 
of  storage  and  retrieval.  Particular  aspects  of  computer 
applications  are  reviewed,  and  a  practical  example  of  an 
information  retrieval  system,  named  SARA,  for  Storage  And 
Retrieval  Alberta,  and  programmed  in  a  new  programming 
language  called  APL  for  a  time-shared  computer, is  given. 

Many  aspects  of  information  retrieval  are  not  covered. 
Information  storage  and  retrieval  systems  vary 
greatly  in  the  degree  of  mechanization.  Semi-mechanized 
systems  utilize  machines  only  in  part  of  the  storage  and 
retrieval  cycle,  usually  in  the  retrieval  of  Information. 

The  preparation  of  documents  for  retrieval  is  done  manually 
in  these  systems.  Fully  mechanized  systems  carry  out  all 
processes  of  the  information  storage  and  retrieval  cycle 
automatically.  It  has  not  been  shown  conclusively  that 
automatic  indexing  is  satisfactory  in  all  disciplines, 
particularly  in  fields  such  as  history.  Figure  1,0,1 
pictures  in  block  diagram  form  a  fully  mechanized  general 
information  storage  and  retrieval  system.  Documents  con¬ 
taining  information  must  first  be  converted  to  a  form 
compatible  with  the  machines.  The  source  of  the  information 


' 


too  r;woi  nt>'  i  '  •(:  e*.f.  31  -  ,*k 


Storage 


Retrieval 


4 


Convert  To 
Machine 
Readable 
Form 


' 

i 

Machine 

Readable 

Document 

] 

' 

Automatic 

Indexing 

Abstracting 

Classification 


Processed 

Document 


Code 

Document 


Display 

File 


Coded 

Document 


Store  in 
Main  File 


Natural 

Language 

Document 


Store  in 
Display  File 


Natural 

Language 

Request 


Edit , 
Code , 
Expand , 
Request 


/ Match  Main 
[File,  Request, 


Addresses 
of  Matched 
Document  s 


Retrieve 

Natural 

Language 

Documents 


Display 

Natural 

Language 

Documents 


Main 

File 


Display  File 


Figure  1.0.1 

Totally  Machine  Processed  Storage  and  Retrieval 


' 


5 


may  consist  of  the  words  in  a  book,  an  article  or  paper, 
or  even  in  personal  communications.  In  most  instances, 
the  source  of  the  information  will  be  in  some  printed 
form.  Other  examples  of  documents  are  voice  or  finger¬ 
print  records.  The  document  is  converted  to  machine 
readable  form,  usually  paper  or  magnetic  tape,  or  punched 
cards.  After  the  information  has  been  changed  into  a  form 
which  can  be  manipulated  within  the  computer,  operations  of 
automatic  indexing,  abstracting,  or  classification  of  the 
document  can  take  place.  Automatic  indexing  consists  of 
mechanically  assigning  valid  index  terms  to  a  document 
sufficiently  accurately  that  the  primary  information  of  the 
document  is  retained.  Automatic  abstracting  is  the  automatic 
assignment  of  a  few  descriptive  natural  language  sentences 
which  best  convey  the  theme  of  the  document.  Automatic 
classification  is  the  automatic  grouping  of  documents  into 
classes  in  order  to  reduce  the  number  of  documents  to  be 
searched.  The  document  can  then  be  coded  into  a  more 
economical  form,  such  as  words  being  replaced  by  numbers, 
for  future  searching;  the  natural  language  form  can  be 
retained  in  a  separate  file  for  display  purposes. 

The  second  phase,  retrieval  of  stored  documents,  is 
the  reverse  of  the  storage  phase.  In  most  systems  a  user 
may  request  documents  by  specifying  index  terms  in  natural 
language,  connected  by  Boolean  operators.  The  request  is 


tani3o£'i$adB  ,8nixsbni  oiaamoMB 


; 

ob  1  o  gniquo'ig  oJ  sri^t  .ai  noi  isoi  -  .esj  ^ 


od'ni  beboo  ed  risriq  nso  Jnemuoob  srtT 


' 


■  yl  3.1ff9?wi'oob 


6 


edited,  coded  and  expanded  according  to  user  options,  and 
is  matched  against  some  subset  of  the  encoded  file  of 
documents.  Documents  (or  addresses  of  documents)  satisfy¬ 
ing  the  match  can  then  be  displayed  to  the  user. 

The  SARA  information  storage  and  retrieval  system  is 
made  up  of  two  parts.  The  first  part  consists  of  the  manual 
selection  of  documents  and  index  terms  and  assigning  rela¬ 
tionships  between  these  terms.  The  second  part  consists  of 
a  mechanized  system  which  deals  with  the  storage  of  these 
documents  and  their  subsequent  retrieval.  To  understand 
the  system,  it  is  necessary  to  understand  some  of  the 
difficulties  encountered.  These  are  described  in  Chapters 

II  and  III.  Chapter  II  reviews  the  literature  on  the 
conversion  of  documents  to  machine  readable  form.  Chapter 

III  describes  advances  made  in  automatic  indexing,  abstract¬ 
ing  and  classification  of  documents.  Chapter  IV  reviews 
seven  operational  information  storage  and  retrieval  systems: 
three  are  designed  around  a  batch-processing  monitor  and 
four  around  a  time-shared  monitor.  Chapter  V  and  VI  are 
the  most  significant.  Chapter  V  deals  primarily  with  the 
overall  plan  and  the  details  of  the  mechanized  SARA  system. 
Chapter  VI  critically  analyses  the  SARA  system,  suggests 
possible  improvements  to  it  and  evaluates  the  programming 
language  used  for  the  problem.  Appendix  A  describes  the 
manual  portion  of  the  SARA  system.  Appendix  B  contains-  block 


isctqsriO  .rntol  eldfibss-i  anirioum  at  aJnamuoob  1o  noie-iavnoo 


{m*  'joJinora  8n±3290O'iq-rtoSBd  s  bnyoiB  bsnsbesb  ais  as'ictJ 


sr.i  ridi w  Yri'isnil'iq  elBdfo  V  i^qsrL’j  . Jrn.BOl'ii:ftalci  ds^ra  ©riJ  ■. 


ioold  anisdnoo  a  xibnsqqA  AfiA?  •rid  -o  noldioq  IsurtBrn 


7 


diagrams  and  listings  of  routines  used  in  the  mechanized 
portion  of  SARA,  while  Appendix  C  contains  examples  of  a 
dialogue  between  man  and  machine. 


.snlriofim  bns  nsm  neswded  ©xjjbqI*^ 


' 


CHAPTER  II 


TEXT  CONVERSION  TO  MACHINE  READABLE  FORM 

2 . 0  Introduction 

Information  can  be  stored  in  many  ways.  The  tribal 
Indians  of  North  America  left  telltale  signs,  such  as  a 
small  cairn  of  rocks, to  inform  following  comrades  about 
their  direction  and  time  of  departure  from  the  camp.  In 
medieval  Europe,  handwritten  manuscripts  became  the  prin¬ 
cipal  method  of  storing  information.  Presently,  the  vast 
‘‘majority  of  information  is  stored  as  characters  printed 
on  a  page.  However,  information  stored  in  this  form 
cannot  be  utilized  directly  by  a  computer.  It  must  be 
converted  to  a  form  which  the  computer  can  manipulate 
easily,  e.g. ,  characters  on  punched  cards  or  magnetic 
tape.  This  chapter  introduces  two  methods  of  conversion: 
keypunching  and  direct  source  production.  The  subject  of 
the  remainder  of  the  chapter  is  a  review  of  the  research 
in  the  field  of  pattern  recognition.  Much  work  has  been 
done  on  this  problem,  and  much  more  will  be  required  be¬ 
fore  a  machine  will  convert  a  printed  page  economically 
into  machine  readable  form.  Only  the  conversion  of 
present  forms  of  storing  information  Into  forms  compatible 
with  a  computer  is  dealt  with  in  this  chapter  and  no  attempt 
is  made  to  resolve  the  problem  of  the  meaning  of  words. 


0  .S 


:mlei9vnoo  lo  aboriiam  owJ  essuboiini  is Jqatlo  aldT 


■ 


^XlBObmonooe  s>s*<3  bajnl-iq  a  sHavooa  ill*  eniriosm  £  9iol 


to  noieiavnoo  arts  lilnO  .<mol  sitfsbesi  anMoam  oJni 
mol  ojrU  noxaamolab  snlaol^  lo  anraol  in«B9iq 


qqmsJia  on  bos  osiqorio  ebtU  nt  Mtn  *Iasb  ai  oeSuqmoo  a  riJiw 


9 


The  patterns  to  be  identified  in  all  programs  dis¬ 
cussed  are  represented  by  a  square  or  rectangular  matrix 
of  logical  elements  (Figure  2.0.1).  Wherever  the  pattern 
coincides  with  the  matrix,  the  corresponding  logical 
element  is  activated;  otherwise,  it  remains  inactivated. 

This  matrix  is  examined  for  characteristics  which  uniquely 
identify  the  input  with  a  name,  such  as  A,  B,  etc.  To  the 
human  eye.  Figure  2.0.1  represents  an  A.  The  problem  is 
to  program  the  computer  to  uniquely  assign  the  name  A  to 
this,  and  any  other  figure  approximating  it.  Pattern 
recognition  methods  are  reviewed  under  four  headings; 
n  -  tuple,  template  matching,  random  nets  and  analytic 
methods . 

2 . 1  n  -  Tuple  Methods 

The  n  -  tuple  pattern  recognition  method  consists 
of  grouping  the  input  matrix  elements  into  clusters  of 
size  n,  and  then  recording  the  state  of  each  group  for 
a  series  of  pattern  names.  Thus,  if  n  equals  one,  each 
group  would  consist  of  two  states,  activated  or  inactivated. 
In  general,  the  number  of  possible  states  for  a  group  of  n 
elements  is  2n.  A  discussion  of  the  case  in  which  n 
equals  one  was  investigated  by  Uttley  and  is  described  in 
the  reference  Uhr  (1963).  Cases  of  n  equal  to  two  have 
been  investigated  extensively,  and  cases  of  n  greater 


.A  fus  stnaaanqan  I.O.S  ©*TiJ a i:15!  t9y9  nfimuri 


t%nldo3&m  ataLqmat  t9lqu3  -  n 


o tnl  e^ns:!9 1&  xii^Brn  3uqnl  artt  gnlqi/ons  1© 


. 


•:to  c  seta  5i.dxi)£cq  lo  ednur.  arts  1 1  ananas  n  I 


esso  9  1J  ‘;o  noIssooaJtb  A 


*vBrf  ow3  o  :  l6L'p9  ji  lo  aeeflO  .  (£dei  )'  ndU  aonaneJan  add 

lo  aaeao  bns  t]clavienadxa  batssXtaavnl  naad 


10 


Figure  2.0.1 


Matrix  Representation  of  the  Character  "A" 


11 


than  two  somewhat  less  extensively  (Bledsoe  and  Browning 
(1959)). 

The  n  -  tuple  method  examines  the  relationship 
between  input,  the  input's  name  and  those  combinations 
of  sets  of  n  cells  in  the  input  matrix  which  are 
activated.  These  sets  may  be  chosen  either  randomly 
or  in  a  predetermined  manner.  The  training  proceeds  as 
follows:  An  input  and  its  name  are  presented  to  the 
machine.  Each  set  of  n  cells  is  examined  to  determine 
its  state.  All  possible  states  of  all  combinations  of 
sets  are  recorded  under  the  name  given  for  the  pattern. 

Consider,  for  example,  a  situation  with  n  equal 
to  two,  as  described  in  Bledsoe  and  Browning  (1959) }  with 
an  input  matrix  of  ten  by  fifteen  binary  photocells.  We 
have,  for  each  of  75  sets  of  two  cells,  say  cells  i  and 
j,  four  possible  combinations  of  off-on:  cell  i  on, 
cell  j  on;  cell  i  on,  cell  j  off;  cell  i  off, 
cell  j  on;  cell  i  off,  cell  j  off.  Thus,  for  each 
pattern  name,  we  have  300  recording  positions.  When  a 
pattern  and  its  name  are  presented,  75  recordings  are 
made,  i.e.,  a  name  for  the  state  of  each  of  the  two  cell 
sets . 

After  a  set  of  patterns  and  their  names  has  been 
presented  a  number  of  times,  an  unknown  pattern  must  be 
identified.  This  identification  proceeds  by  the  presentation 


. 


bfi£  i  aXioo  iiBB  ,e  Iso  owJ  -lo  ai»s  ?T  To  rtoss  ^  .*>“<* 


' 


. 

.  ,  t  r'-n. :.n  brtB  Kells q 


12 


of  the  input,  examination  of  all  75  sets  of  two  cells, 
and  summation  of  all  possible  names  over  the  75  corres¬ 
ponding  off-on  states  of  the  input  pattern.  The  name 
corresponding  to  the  highest  score  is  chosen. 

The  results  of  the  tests  by  Bledsoe  and  Browning 
(1959)  were  quite  encouraging.  The  experiment  was 
continued  with  other  modifications,  such  as  normalization 
of  the  pattern,  examination  of  the  distribution  of  possible 
patterns,  and  consideration  of  word  context.  Using  hand¬ 
written  patterns  with  a  large  number  of  names  (36),  up  to 
9^  percent  correct  recognition  was  attained. 

2 . 2  Random  Nets 

The  random  net  approach  is  discussed  by  Brain, 

Porsen,  Nilsson  and  Rosen  (1962)  as  well  as  by  Rosenblatt 
(I960).  The  basic  component  of  the  random  net  approach 
consists  of  a  unit  called  the  Threshold  Logic  Unit  (TLU) . 

A  set  of  input  lines  fires  into  the  unit,  and,  if  the 
weighted  sum  of  these  inputs  exceeds  a  specified  threshold, 
the  output  fires. 

One  version  of  the  random  net  approach  (Brain,  et 
al.  (1962))  is  that  of  the  Perceptron  (Figure  2.2.1). 

Three  sets  of  Threshold  Logic  Units,  referred  to  as  S,  A, 
and  R  units,  are  used.  The  first  set  of  35  sensing  units 
(S  units)  is  constructed  as  an  input  matrix.  Each  of  ten 


.  "  :  y‘ 


13SO<1<1<1S  49(1  motoiisn  ort4  to  Jnsft oqmoo  o>«s4  or"  ,<  *9 

.  (UJT)  JlnU  ataoJ  blorie»'irlT  art?  beXIao  4inu  a  o  sieXsnoo 


' 


' 


13 


35  sensing  (S)  units 


Figure  2.2.1 
Schema  of  a  Perceptron 


14 


associative  units  (A  units)  are  randomly  connected  to 
nine  S  units.  All  ten  A  units  are  then  connected  to  a 
single  response  unit  (R  unit).  The  connections  between 
S  and  A  units  can  be  denoted  by  a  logical  matrix,  C,  which 
denotes  a  connection  between  an  A  and  an  S  unit.  Also,  a 
weight  vector,  W,  denotes  the  weights  associated  with  the 
connections  between  the  A  units  and  the  R  unit.  Note  that 
the  input  is  to  be  classified  under  one  of  two  names,  i.e., 
the  output  of  the  R  unit  is  binary. 

The  training  of  this  Perceptron  consists  of  the 
presentation  of  a  pattern  to  the  S  units  together  with 
its  correct  name.  If  the  R  unit  gives  the  correct  response 
another  pattern  is  presented.  If  an  incorrect  response  is 
given,  the  weight  vector,  W,  is  altered  by  an  amount  that 
would  be  sufficient  to  give  the  correct  response  if  the 
same  pattern  were  immediately  presented  again.  After  the 
presentation  of  a  determinable  number  of  patterns  the 
Perceptron  will  converge  to  the  correct  response  if  one 
of  a  set  of  error-correction  rules  is  used.  The  set  of 
error-correction  rules,  and  a  proof  of  their  convergence, 
is  given  by  Nilsson  (1965)"  Brain,  et  al.  (1962)  cite 
another  example  of  a  random  net  approach  called  Madaline. 
This  random  net  approach,  developed  by  Widrow,  consists 
of  the  same  arrangement  as  found  in  the  Perceptron.  How¬ 
ever,  instead  of  the  W  vector  being  altered  in  the  training 


,6Dn»M»vnoo  8**113  lo  Too-.q  £  fc-u  flOlJSSTtO’P-'IOn* 


15 


procedure,  the  C  matrix  denoting  the  connections  between 
A  and  S  units  is  altered.  The  W  vector  remains  constant. 

No  concrete  results  of  tests  are  given  in  this  paper;  it 
is  felt  that  better  results  can  be  obtained  by  less  random 
techniques . 

Roberts  ( 1 9 6 0 )  did  some  experimental  work  with  random 
nets,  which  duplicated  and  extended  the  work  done  on  the 
Perceptron.  By  using  a  modified  error-correction  rule, 
and  altering  the  W  vector,  Roberts  achieved  up  to  94  percent 
correct  classification  on  a  set  of  44  characters.  A  further 
constraint  was  imposed  by  Roberts  in  order  to  achieve  this 
high  degree  of  correct  response.  Instead  of  completely 
random  connections,  the  C  matrix  was  preset  by  the  author 
such  that  all  S  units  were  uniformly  distributed  over  the 
A  units.  This  choice  of  the  C  matrix,  along  with  the  error- 
correction  rule  used,  produced  the  effect  of  recognizing 
spatial  nearness  In  the  device,  an  effect  which  the  Per¬ 
ceptron  and  Madaline  failed  to  achieve. 

The  random  net  approach  emphasizes  parallel,  as 
opposed  to  sequential,  processing  of  information.  In 
parallel  processing  all  information  is  gathered  at  each 
stage;  a  decision  is  not  made  until  all  possibilities 
have  been  calculated.  The  next  stage  of  the  classifica¬ 
tion  is  then  initiated.  In  sequential  processing  a 
calculation  is  made  and  a  decision  is  arrived  at  after 


9aa  no  snob  H-w*  adt  b*i»naJ*e  bn*  lirlriw  ,! 


srid  Tflvo  b9Jud±Ua.rb  itHmollnu  s-iew  ?Jinx>  E  '  Xb  +ei 


svslrioa  0  3  b$li£l  ©fillBbfiW  bn*  nonrqeo 


_  . 


.n'jlJi.m  lr  To  sr.leeeoc  iq  tistJnsvp9a  o3  be 

■ 


16 


each  calculation.  The  choice  of  a  superior  method  depends 
on  the  cost  of  making  a  decision  versus  the  cost  of  making 
a  calculation. 

Selfridge  (1959)  stresses  parallel  processing.  A 
set  of  cognitive  and  computational  "demons"  are  connected 
in  layers,  each  layer  being  equivalent  to  the  A  units  of 
the  Perceptron.  The  initial  layer,  referred  to  as  data 
demons,  corresponds  to  the  S  units,  and  the  decision  demon 
corresponds  to  the  R  unit.  All  demons  of  each  layer  are 
connected  to  all  demons  of  the  next  layer.  There  is  no 
randomness  in  these  connections.  Various  training  pro¬ 
cedures  are  described,  but  no  results  are  given. 

2 «  3  Template  Matching 

Template  matching  is  the  easiest  and  most  widely 
used  commercial  method  for  pattern  recognition  to  date. 
However,  it  is  quite  limited  in  its  scope.  The  method  as 
outlined  by  Minsky  (1961)  consists  of  two  phases,  the 
first  being  the  normalization  of  the  input  pattern.  This 
is  achieved  by  changing  the  relative  size  of  the  pattern 
to  match  the  size  of  internally  stored  pattern,  and  the 
rotation  of  the  pattern  about  some  point,  usually  its 
center  of  gravity,  in  order  to  orient  it  to  the  internally 
stored  replica.  The  data  may  also  be  smoothed.  The  second 
phase  consists  of  matching  the  normalized  pattern  against 


•  3-ai 


1  Llbusu  e  JiiJtqr.  emob 


. 


17 


a  previously  stored  set  of  all  the  possible  patterns . 
Similarities  are  calculated  for  all  patterns,  and  the 
name  with  the  highest  similarity  score  is  chosen. 

There  are  many  drawbacks  to  such  a  system,  A  proto¬ 
type  of  all  possible  patterns  must  be  available  to  the 
machine  prior  to  identification,  A  set  of  similarity 
tests  must  be  programmed  into  the  model.  Abstract  classes 
of  patterns,  such  as  all  patterns  with  three  intersections 
of  straight  lines,  cannot  always  be  handled.  Slight  varia¬ 
tions  of  the  input  patterns  are  critical  to  correct  identifica¬ 
tion  . 

However,  it  will  be  noted  that,  to  date,  this  is 
the  principal  method  employed  by  commercial  machines. 

Subject  to  the  above  limitations,  the  percentage  of  correct 
identifications  is  extremely  high.  If  the  percentage  of 
correct  identifications  is  the  only  criteria  of  success 
then  this  method  rates  high  among  all  the  methods  reviewed 
here , 

2 . 4  Analytic  Methods 

The  analytic  methods  developed  to  date  most  closely 
parallel  the  observed  functioning  of  human  processing  of 
information,  Gyr,  Brown,  Willey  and  Zivan  (1966)  suggests 
a  recognition  algorithm  which  only  recognizes  straight 
lines.  The  algorithm  is  original  in  that  it  attempts  to 


\ 


1  1  tfoirf*  1  a  n6tt±n%o±  3-t 


18 


simulate  the  observed  behavior  of  humans  as  closely  as 
possible  without  regard  to  the  efficiency  of  the  algorithm. 
The  input,  a  144  by  144  logical  matrix,  is  scanned  by  a 
smaller  matrix  called  the  retina.  This  is  a  36  by  36 
operator  matrix  which  is  divided  into  a  periphery  (the 
outer  section  of  the  retina),  and  the  fovea  (the  main 
detector) .  As  the  retina  moves  across  the  pattern,  it 
is  directed  along  the  straight  line  by  the  fovea.  The 
scan  is  divided  into  two  parts.  If  a  "quick  look" 
criteria  is  satisfied,  then  the  scan  continues;  if  not, 
a  "close  look"  is  initiated,  and  a  decision  is  made  on 
whether  it  is  still  on  a  straight  line.  Small  amounts 
of  noise  can  be  tolerated. 

Other  more  elaborate  systems  have  been  programmed 
to  recognize  more  than  sections  of  a  pattern.  Selfridge 
and  Neisser  (i960)  developed  a  program  which,  after  clean¬ 
up  and  normalization  of  a  pattern,  inspects  features  of 
the  pattern,  and  ranks  each  pattern  by  its  similarity 
with  respect  to  these  features.  During  training,  28 
features  such  as  "the  maximum  intersection  with  horizontal 
line",  "concavity  facing  south",  and  "length  of  the  south 
edge"  are  inspected  for  each  of  the  ten  possible  patterns, 
and  probabilities  of  occurrence  are  calculated  for  each 
pattern  for  each  feature.  Upon  presentation  of  a  pattern 
to  be  identified,  all  such  features  are  inspected,  and  the 


^X9«9lo  8fi  ?,r  uri  'to  ~o2vsrtdd  tavTsedo  art* 


skids'!  arid  ballao  xt'idfim  *iall£m8 


rid  raodoa  eavorr  jsal:9*i  erfd  sA 


.  »  j  d  -ti  •:•?  .*•£  --rfJO 


m  :»d cf£q  £  lo  noldBdnaea'iq  noqU 


srtd  bns  ^bsdoeqani  s'ib  sa«ufds«l  rioue  Xifl  fba.rt.tdnab2  ad  od 


19 


probabilities  for  all  patterns  are  summed  up „  The  name 
corresponding  to  the  largest  sum  is  assigned  to  the  pattern. 
No  results  are  given.  In  the  programs  discussed  above,  all 
tests  performed  are  programmed  into  the  model.  In  the 
experiment  discussed  below  not  only  are  the  characterizing 
operators  evaluated,  but  they  are  generated  by  the  program 
itself. 

Uhr  and  Vossler  (1963)  developed  and  tested  a  program 
which  would  generate  and  evaluate  operators,  and  then  dis¬ 
card  the  useless  ones.  The  input  consists  of  a  20  by  20 
binary  matrix  which  is  scanned  by  a  five  by  five  operator 
matrix.  This  operator  matrix  is  generated  either  randomly 
or  deterministically,  and  characteristic  strings  of  the 
patterns  are  generated.  These  strings  serve  to  retain, 
as  well  as  to  generalize,  the  learned  patterns.  Records 
of  success  for  the  various  operators  are  kept,  and  those 
of  little  use  are  discarded.  Amplifiers,  which  are  used 
in  general  as  well  as  in  local  discrimination  functions, 
are  adjusted,  and  serve  to  discriminate  between  the 
patterns.  The  program  was  tested  on  hand  printed  and 
written  characters  as  well  as  voice  patterns.  There  was 
a  high  degree  of  success  after  ten  training  samples.  A 
revision  of  this  work  (Praether  and  Uhr  (1964))  appears  to 
be  less  sensitive  to  noise  and  to  the  thickness  of  the 
pattern,  although  the  results  of  the  tests  are  obscure. 


.  emeiasq 


.  esXq«j&«  gnialstJ  nsi  isJ''.  i  azeo oub  1o  ssigab  rigid  8 
8'ifisqqB  ( ( ’Ad^£)  irlU  5nfi  isrtdssn^ )  jIiow  slrid  'to  nplslvai 


aeaculolrid  9ri*  o3  kna  salon  evltfl&nes  seal  dd 


20 


Grimsdale,  Sumner,  Tunis  and  Kilburn  (1959)  approached 
the  problem  in  a  different  way.  Each  pattern  was  first 
divided  into  various  components  by  a  scan,  and  then  analyzed 
as  the  components  were  reassembled.  More  information  was 
retained  about  the  topology  of  the  figure  than  by  previous 
methods,  and  the  system  was  relatively  insensitive  to 
orientation  of  the  pattern.  The  approach  involved  an 
analysis  of  the  pattern  as  a  whole  since  information  about 
the  form  of  each  part  and  its  connection  with  the  other 
parts  of  the  pattern  was  retained. 

2 . 5  Discussion 

There  are  many  approaches  to  solving  the  problem  of 
creating  machine  readable  text.  The  most  general  method, 
that  of  pattern  recognition,  has  advanced  rapidly  since 
it  was  first  proposed.  However,  any  commercial  system 
now  available  is  not  only  expensive  but  cannot  recognize 
the  wide  range  of  type  fonts  which  would  be  required  of  it. 
The  most  promising  technique  at  this  time  appears  to  be 
capture  of  the  data  at  the  point  of  publishing.  Some 
publications,  such  as  Chemical  Abstracts,  presently  provide 
such  a  service.  Magnetic  tape  copies  of  the  abstracts 
provided  by  Chemical  Abstracts  are  available.  Cooperation 
among  publishers  and  users  in  this  direction  may  yield  the 
most  benefit  in  solving  the  problem  of  producing  machine 


readable  text. 


* 


21 


The  field  of  history  is  of  particular  concern  in  the 
present  investigation .  Since  there  is  already  a  great 
deal  of  information  in  printed  form,  a  solution  to  the 
problem  of  character  recognition  would  be  of  great  value. 


> 


CHAPTER  III 


AUTOMATIC  ANALYSIS  OF  TEXT  MATERIAL 

3 . 0  Introduction 

One  of  the  most  Important  phases  in  the  Information 
storage  and  retrieval  cycle  is  that  of  preparing  documents 
for  subsequent  retrieval.  Chapter  II  indicated  one  manner 
of  preparing  documents.  Another  way  could  consist  of 
assigning  index  terms,  or  descriptors,  to  documents.  An 
index  term  describes  part  or  all  of  the  content  of  a  docu¬ 
ment,  For  example,  the  index  terms  "church” ,  "economics", 
"politics"  and  "1930"  may  describe  a  document  entitled 
"The  Political  Impact  of  Church  Estates  In  1930",  Retrieval 
of  such  documents  depends  on  the  appropriateness  of  the 
index  terms  assigned  to  the  documents.  The  document  must 
be  described  as  concisely  as  possible  for  efficient  retrieval, 
yet  at  the  same  time  as  comprehensively  as  possible  for 
retrieval  In  the  future.  Hence,  the  text  must  be  described 
in  terms  sufficient  for  both  present  and  anticipated  needs. 

The  digital  computer  has  long  been  recognized  as  a 
device  which  could  be  used  to  analyse  text  automatically. 

With  the  computer’s  very  powerful  arithmetical  and  logical 
capabilities,  a  statistical  analysis  of  machine  readable 
text  material  becomes  possible,  once  the  type  of  analysis 
has  been  determined.  With  the  large  memories  of  the  machines 


s  .  -q  a  Jn  mjoc b  ri  . !-  5 


•  *E* 


, 

‘ 


23 


of  today  and  with  the  economic  feasibility  of  ever  increasing 
memory  sizes,  machines  can  perform  logical  operations  among 
large  numbers  of  words  of  text.  Moreover,  complex  table 
lookups,  which  can  be  useful  in  automatic  analysis  of  text 
material,  add  to  the  power  of  a  computer. 

The  measurement  of  the  effectiveness  of  such  indexing 
is  in  itself  a  large  problem.  Although  comparison  with 
human  indexing  is  the  most  obvious  method  of  measurement, 
it  is  not  entirely  satisfactory:  humans  cannot  always 
agree  on  how  a  document  should  be  indexed,  abstracted,  or 
classified.  Criteria  unrelated  to  human  measurement  should 
be  set  up  in  such  a  way  that  the  aims  of  automatic  indexing 
are  satisfied,  viz,  such  that  the  indexing  terms  used  to 
describe  the  article  result  in  high  relevance  to  the  docu¬ 
ments  retrieved,  or  in  comprehensive  classification. 
Ultimately,  the  effectiveness  of  the  indexing  will  be 
determined  by  the  appropriateness  of  the  retrieved  documents. 

Presently,  much  manual  effort  is  being  expended  on  the 
indexing,  abstracting,  and  classification  of  documents.  As 
the  volume  of  material  to  be  stored  Increases,  more  trained 
personnel  will  be  required  and  hence,  investigation  into 
the  automation  of  these  tasks  can  be  justified. 

3 . 1  Automatic  Indexing 

In  all  phases  of  automatic  analysis  of  text,  some  form 
of  automatic  indexing  is  used.  Most  of  the  methods  used  to 


isJuqmoo  &  lo  isjwoq  o)  bfcs  4Ijb1'1»^boi 


■ 


■ 

■ 


•  t> 9  5 U  ex  gl  a  rti  Dl  ;  :  JL  £  lO 


24 


perform  automatic  abstracting  and  classification  eventually 
depend  on  a  choice  of  index  terms  which  describe  the  text. 

If  the  methods  are  to  be  effective,  the  choice  of  index 
terms  is  critical. 

The  methods  to  be  described  for  choosing  informative 
index  terms  are  dependent  on  word  frequency.  Some  methods 
also  use  auxilary  information,  such  as  the  frequency  of 
occurrence  of  word  groups  and  the  function  of  the  word  in 
the  sentence. 

Pioneering  work  in  the  field  of  automatic  indexing 
was  done  by  Luhn  (1958).  Using  an  IBM  704  computer  he 
analyzed  scientific  and  technical  text  punched  on  cards, 
and  produced  a  list  of  significant  words  ranked  by  frequency 
of  occurrence  in  the  text.  Luhn  reasoned  that  the  frequency 
of  occurrence  of  a  word  root  in  the  text  was  a  measure  of 
its  "information  power".  For  example,  "differ",  "differ¬ 
entiate",  "difference"  and  "differently"  are  all  of  the 
same  root.  In  calculating  the  information  power  of  these 
words,  all  forms  of  "differ"  would  be  considered  identical. 

A  frequency  count  of  all  the  words  resulted  in  a 
curve  similar  to  that  given  in  Figure  3.1.1.  The  words  of 
high  frequency  such  as  "the",  "a",  etc.,  constitute  noise, 
and  could  be  eliminated  by  a  table  look-up  procedure. 

Walston  (1965),  In  reviewing  the  work  of  Luhn,  suggested 
that  a  high  frequency  cutoff  through  statistical  analysis 


■ 


OU&X&B* 8  daiioirii  'iloJuo  yoneupsnl  rigid  *  J£ri* 


25 


could  also  be  used.  This  is  line  C  in  Figure  3.1.1.  The 
remaining  words  were  then  ranked  by  frequency,  and  those 
of  highest  frequency  were  chosen  as  index  terms  (between 
lines  C  and  D  in  Figure  3.1.1).  Luhn  further  reasoned 
that  the  information  power  of  the  words  bracketed  by  lines 
C  and  D  was  represented  by  curve  E.  Thus,  the  list  of 
words  produced  would  constitute  a  representative  picture 
of  the  text  analyzed.  This  method  is  oriented  toward, 
and  easily  implemented  upon,  computers.  However,  any 
information  contained  in  the  grammar  or  syntax  of  the 
article,  or  groupings  of  words,  is  ignored. 

Baxendale  (1958)  compared  three  methods  of  automatic 
indexing.  The  primary  aim  was  to  decrease  the  amount  of 
text  analyzed,  and  still  retain  as  much  as  possible  of 
the  information  contained  in  the  entire  text.  A  set  of 
six  papers  from  six  different  scientific  journals  was  used 
to  test  each  of  the  three  methods.  In  all  three  methods, 
the  technique  of  deletion  of  common  words  before  analysis 
of  the  text  was  used  to  decrease  the  noise  factor.  The 
first  method,  similar  to  Luhn ’ s ,  was  an  analysis  of  the 
entire  text,  with  subsequent  ranking  by  frequency  of  the 
remaining  words  after  deletion  of  common  words.  The  second 
method  was  an  analysis  of  the  topic  sentences  of  each  para¬ 
graph.  A  previous  analysis  indicated  that  85  percent  of  the 
topic  sentences  occurred  as  the  first  sentence,  while  seven 


9  x  is.  h  tv,  jn*u port  V.v1  b  Acibi  n*ci3  eiew  eb^ow  j,ninl£fli9a 


9'ijjdolq  svjt^e^n939'iq9T  &  s3u3L3 anoo  Muow  t>90JjbOTq  sbiow 


. b  »*!,  ngl  al  «B£acw  ;o  e:  niqi/0«is  t:o  <9  ol^26 


oldsmodUB  To  sbofi^9rn  esirid  b9*t£qmoo  (8ciPO  9l£bfl£X£S 


■ 


■IsnjB  a'lol 


' 


Yd  gnXtfrcBn  i isupeadua  rfdlw  tdxed  STJcffr.* 
.  abnow  normooo  lo  noJtd9l9b  ebaow  snlnlemeT 


-.fi'ir.q  do£9  lo  89on9Xnaa  oJtqod  arid  lo  eX2Yl£ns  fi£  aBw  bortdsm 


nsq  58  dsrf;t  be;t£oJtbnJt  aiaY1*^  euoiv9fiq  A 


26 


high-frequency 

cutoff 

low-frequency 

cutoff 

information  power 
of  significant  words 


-  significant  words 


WORDS 


Figure  3.1.1 

A  Word  Frequency  Diagram 


■ 


.  •  -- 

■ 


27 


percent  occurred  as  the  last  sentence,  of  each  paragraph. 

The  second  method  consisted  firstly  of  the  selection  of 
the  first  and  last  sentence  of  each  paragraph,  secondly, 
the  deletion  of  common  words,  and  thirdly  the  ranking  of 
the  remaining  words  by  frequency.  The  third  method  utilized 
the  fact  that  in  English  much  information  is  carried  in 
prepositional  phrases.  By  comparison  with  a  previously 
compiled  list  of  prepositions,  the  prepositional  phrases 
of  the  document  were  isolated.  Selection  of  the  following 
four  words  (unless  punctuation  or  another  preposition 
intervened)  constituted  the  selection  set.  Common  words 
were  deleted,  and  the  remaining  words  were  ranked  by 
frequency.  The  three  methods  resulted  in  a  remarkable 
similarity  among  the  ranking  of  the  words.  By  using  the 
second  or  third  method  described  above,  results  compar¬ 
able  to  the  first  method  could  be  attained  with  less 
processing  and  less  machine  readable  text.  As  a  by-product 
of  the  third  method,  terms  used  together  in  the  original 
article  remain  coordinated  to  a  certain  degree,  thus 
retaining  some  of  the  syntax  of  the  original  article. 

Edmundson  and  Wyllys  (1961)  used  a  different  approach 
to  the  problem  of  selecting  index  terms  to  describe  a  docu¬ 
ment.  They  advanced  the  argument  that  the  information 
contained  in  a  word  is  inversely  proportional  to  the 
frequency  of  occurrence  of  the  word;  thus,  the  rare  or 


3  1  Jnsi  s rid  x  bnirid  bcw  tabn ow  nommoo  Jo  no Idalab  ©rid 


' 


noldBudortuq?  asslnu)  ebnow  'tuo'f 


,.  9  naidojlse  erid  bsdudldsnoo  (bensvis  t  :. 


Yd  boons'!  9*i9w  ebnow  gnInl£xn9T  ©rid  bns  cb9d9l9b  9n9w 


9l-d£^nBm9n  b  nl  bscUuesn  sboridsm  asnrid  art? 


.  sbrroiw  ©rid  Jo  sni^nBT  9rid  gnomfi 


j^onq-Yd  b  bA  . .  x  :  J  slcUbsen  ©nJ-rfOBm  ae9l  bns  anla390o,iq 


' 


■ 


-uoob  js  sdinoea b  od  8mn©d  xsbnl  jjnldoalea  lo  meldonq  ©rid  od 

■ 

nol  dfinnolni.  ©rid  dsrid  dfigmi/siB  9rid.  b9on£vb£ 

erid  od  Ifinoldnoqonq  Y^98^9Vn-t  ai  Mow  6  nl  bsnlfidnoo 


©rid  <aurid  ;bnow  ©rid  Jo  eonsiiUDOO  Jo  xonsupsiJ 


28 


unusual  words  In  an  article  give  the  greatest  indication 
of  its  content.  However,  the  word  must  be  rare  in  general 
usage ,  not  rare  within  the  article  itself.  Four  signifi¬ 
cance  factors  for  each  word  are  suggested: 

(3.1)  s1  =  f  -  r 

(3.2)  s2  =  f  /  r 

(3.3)  Sg  85  f  /  (f  +  r) 

(3. 4)  s^  =  log  (f  /  r) 

where  s  =  the  significance  factor  of  a  word; 

f  =  the  relative  frequency  of  a  word  within  the 
document ; 

r  =  the  relative  frequency  of  a  word  in  general  use. 

These  significance  factors  are  analyzed  to  determine  which 
have  the  greatest  relevance  to  the  document  being  indexed. 
The  author  concludes  that  the  functions  s^  =  f  -  r  or 
s^  =  f  /  r  are  the  most  relevant,  though  no  experimentation 
was  carried  out  by  the  author.  This  method  does  not  require 
that  common  words  be  deleted.  The  significance  function 
will  handle  these. 

A  further  classification  of  the  article  into  various 
spatial  categories,  such  as  title  or  introductory  paragraph. 


:-.t  91  si  9d  J=um  bio*  9rtJ  .isvswoH  .JnsJnoo  a.tl  1o 


'  *  *  ' 


4  »  f  *31  ,  t  ®  rI 


.93U  Ifli9nes  ni  blow  s  “to  <on90p9i't  9/t  *r.  9Cli  1  i 

nciisjn9fnbi9qx9  on  d8uortJ  .JruMalsi  Jaoai  9ri3  sis  1  x  '  1  S8 
91I u psi  Jon  «90b  bodJsm  sirif  .-xoriins  sd?  Yd  iuq  tsioiEo  asw 


29 


may  be  used  to  add  weights  to  the  significance  functions. 

These  weights  reflect  the  amount  of  information  that  the 
particular  word  carries  by  virtue  of  its  position  in  the 
article.  The  final  significance  function  may  then  be  calcu¬ 
lated  as 

(3.5)  sf  =  b1b2b3s(f,r) 

2^1  and 

b^  if  the  given  word  occurs  in  the  title 
1  otherwise 

bf  if  the  given  word  occurs  in  the  first 
paragraph 

1  otherwise 

b  if  the  given  word  occurs  in  the  summary 
s 

1  otherwise 

,  and  b  are  arbitrarily  assigned  weights 
I  s 

determined  by  experience. 

An  experiment  was  conducted  by  Damereau  (1965)  on  the 
criteria  suggested  by  Edmundson  and  Wyllys.  Eight  articles 
on  world  politics  appearing  in  Atlas  magazine  were  indexed 
for  testing  purposes.  The  frequency  of  occurrence  of  words 
in  general  use  were  obtained  from  approximately  one  million 
words  of  radio  news  broadcasts  in  the  field  of  world  politics. 


where  b^,  b^. 


The  terms  b  , 

U 


*r(d  nl  noldlaor;  adl  Jo  *Jd3Jtv  ^d  e^insa  blow  TBiuoIdna q 


I  <_  ~d  trd  t  L  i 


r'  J 


SSXvnSriJC  I 


sw  fosnalaea  ^ll'ifi'XdichiB  sna  d  bits  <4.d  «  d 

.sansi'isqxt  ^d  banlrmedob 

. 


. 

esioldne  drigi3  .a\£lIxW  bns  oa«brttwab3  yd  bsdeesgue  B-tnsdlno 


bsx^bnt  919W  snlsflSBm  aaidA  nJt  gnX'iBsqqjB  s  >X;  lioq  no 


9onemuooo  *to  ^onsups'i'l  sriT 

noilXixn  ©no  lilsdsraXxonqqs  mo-rt  bsnXeddo  ans*  9sis  l&riens$  nl 


eol  tjtlo  4  blnow  lo  blsll  arid  nJt  adaaobaond  awen  olbart  lo  abnow 


30 


He  first  indexed  a  series  of  articles  manually,  and  then 
indexed  them  using  a  Poisson  probability  function.  The 
index  terms  for  the  article  were  chosen  as  those  words 
which  occurred  sufficiently  often  that  the  probability  of 
such  a  frequency  of  occurrence,  in  general  usage,  is  less 
than  or  equal  to  0.0005.  In  addition,  the  three  functions, 
s  ^  =  f  -  r ,  s^  =  f  /  r  and  s^  =  f  /  (f  +  r)  were  cal¬ 
culated,  and  the  list  of  words  chosen  by  each  function 
were  compared  to  those  chosen  manually  as  well  as  by  the 
Poisson  distribution  function.  The  results  indicated 
that  the  Poisson  criteria  minimized  both  the  number  of 
extra  words  chosen  and  the  number  of  index  terms  missed. 
Damereau  also  pointed  out  that  the  approach  taken  by 
Edmundson  and  Wyllys  is  very  difficult  to  test  and  im¬ 
plement  because  a  universe  of  terms  must  be  created  with 
which  to  compare  the  words  of  the  documents.  Similarly, 
before  the  efficiency  of  the  SARA  system  can  be  judged, 
a  suitable  universe  of  terms  must  be  compiled  which  will 
suit  the  needs  of  the  application,  i.e.,  history. 

A  common  and  valid  criticism,  such  as  that  given  by 
Bourne  (1963),  of  the  techniques  described  above  is  that 
the  frequency  of  occurrence  of  words  is  the  only  criteria 
used  in  choosing  index  terms.  No  consideration  (except, 
perhaps,  the  prepositional  phrase  method  of  Baxendale)  is 
given  to  the  syntax,  order,  or  grouping  of  the  words. 


’ 


:*r.  7  *.  ?u.  '  ...  \  Ji  ar!M T  -C 


.  .  ;  •  ;  9'fj  •  '  '•  Ve 


. 


■ 


i  c  3«oo  oH 

“ 


ifl  t  lo  $nlqu 01$  ic  ,i9bio  t ■ a^r <a  o3  frevjtg 


31 


There  are  difficulties,  however,  in  determining  how  signifi¬ 
cance  should  be  attached  to  such  information.  Word  groupings, 
as  well  as  their  frequency,  have  been  investigated  by  Oswald 
(see  Edmundson  and  Wyllys  (1961)).  His  treatment  is  an 
extension  of  the  work  of  Luhn  and  Baxendale. 

Tests  must  be  made  to  measure  the  significance  of  the 
index  terms.  Some  criteria,  such  as  comparison  with  a  human 
indexer’s  results,  should  be  satisfied.  This  has  been  done 
in  the  above  experiments  by  using  manually  generated  index¬ 
ing  terms  in  automatic  abstracting.  A  study  of  the  results 
given  by  the  SARA  system  and  the  reaction  of  users  to  these 
results  could  indicate  the  appropriateness  of  the  index 
terms  chosen  for  the  system  and  improvement  in  the  choice 
of  terms. 

Another  concept  that  should  be  considered  is  that  of 
attaching  weights  to  the  indexing  terms.  These  weight 
factors  may  be  determined  by  the  frequency  of  occurrence  of 
the  corresponding  index  term,  and  signify  the  relevance  of 
the  indexing  term  to  the  article.  The  position  in  the 
hlerarchial  classification  system  that  the  word  occupies 
could  also  be  taken  into  consideration;  the  nearer  the  term 
is  to  the  root  of  the  hlerarchial  tree,  the  more  general  it 
is,  and  hence  the  less  significant. 


f 


32 


3 . 2  Automatic  Abstracting 

The  next  step  in  the  automatic  analysis  of  text  is  that 
generally  referred  to  as  automatic  abstracting  (or,  more 
accurately,  automatic  extraction  of  representative  sentences). 
In  automatic  indexing,  a  significance  function  is  calculated 
for  a  word;  in  automatic  abstracting  a  significance  function 
is  calculated  for  a  sentence.  The  sentences  are  then  ranked 
by  their  significance  factors  and  the  high  ranking  sentences 
are  used  to  form  an  abstract  of  the  article.  The  signifi¬ 
cance  factor  attached  to  each  sentence  is  a  function  of 
the  selected  index  terms.  It  is  a  logical  extension  of 
calculation  of  the  significance  factors  for  individual  words. 

Luhn  (1958)  calculated  significance  factors  of  sentences 
by  the  following  method.  Each  sentence  was  scanned  to  deter¬ 
mine  if  it  was  bracketed  by  previously  obtained  significant 
words.  If  no  more  than  five  non-significant  words  inter¬ 
vened  between  significant  words,  the  grouping  was  considered 
significant.  The  significance  factor  was  then  determined 
by  squaring  the  total  number  of  significant  words  in  the 
resulting  cluster,  and  dividing  by  the  total  number  of 
words  in  the  cluster.  The  highest  ranking  cluster  of  a 
group  of  clusters  in  a  sentence  was  used  as  the  significance 
factor  of  the  sentence. 

A  limitation  to  this  technique  becomes  evident  If  a 
sentence  has  several  low  ranking  clusters.  Such  a  sentence 


9Tom  <*10)  v.f:  t  job*:  jada  ol  y&moJLB  3B  od  bsi'iels**  ^XlBT^neS 


JbsdBljjolBo  at  neivorufl  sonBollinaXs  s  tg;*lx9bnX  oidBmodt/s  rti 
r  >o.c  Inal «  £  antdOB*xl8dfi  o:4  ‘"o-u  nl  {Mow  b  lo'i 

3BoaeJn92  a^X^nsi  rigid  9rid  bne  snodOBl  aonsoXlJtrrgle  nlerid  ^d 

.sIold'iB  arid  lo  doandeds  ns  m*xol  od  bseu  eris 

' 

. sbtow  CsubivtbnX  no'  enojoii'i  9on£oXllnsla  arid  lo  noidaXiioXso 


eeonadnsa  lo  aiodosl  eonBolllngla  b9dBXuoiBo  (P?€X)  nriuJ 


.boride  •  :  \'o i  i  o1:  -  : 


. 


nerid  esw  nodOBl  scnBoXllngla  arCT  . dnBOIllnaXe 
abnow  dnBoll ingle  lo  nsdmun  Xsdod  srid  gniTSjjpe  ^d 

b  lo  'i9denIo  gni^nBn  daeriglri  9dT 

SB  t>98i)  8BW  90fI9dn9S  B  Hi  ST9J8«X0  lo  qjJO'Ig 


.son^dnaa  ®rid  lo  *iodoBl 


dn~*blv9  3  9310  9 9d  Buplnrfosd  elrid  od  noldBdXinXX  A 


33 


will  not  be  chosen  over  a  high  ranking  single  clustered 
sentence  although  more  information  may  be  contained  in  the 
former.  Edmundson  and  Wyllys  (1961)  suggest  a  combination 
of  both  word  grouping  and  number  of  clusters  within  a 
sentence  as  the  criteria  for  selection  (modified  by 
sentence  length).  They  suggest  a  function,  E,  of  the 
significance  factor  s,  such  that  E(s)  >  1  for  large  s, 
E(s)  -  1  for  medium  s,  and  E(s)  <  1  for  small  s.  The 
significance  factor  could  then  be  a  combination  of  the 
occurrence  of  two  significant  words  in  the  text,  their 
position  and  the  number  of  occurrences  within  a  document. 

The  results  of  both  tests  are  quite  encouraging.  Although 
some  of  the  abstracts  lack  continuity,  the  idea  of  the 
article  is  conveyed. 

3 . 3  Automatic  Classification 

Automatic  classification  takes  automatic  Indexing 
one  step  further,  and  organizes  a  mass  of  unrelated  data 
(index  terms)  into  an  hierarchial  structure  which  aids  the 
computer  in  its  subsequent  retrieval  request.  The  problem 
is  one  of  entering  the  document  under  the  proper  index  terms 
so  that  a  search  request  will  subsequently  retrieve  the 
relevant  document.  Again,  automatic  classification  is 
determined  by  the  index  terms  selected  by  the  indexing 
procedure , 


yd  belUbom)  noldosXsa  iol  fllnsdl'iD  © rid  sb  ©onsdnaa 


,  .  9v,‘jvj^'j  ei  ©load's* 


5:  :  sbn  i  ©IdAmodua  aa&Jsd  noidao.t'lXeeBlo  ol  JaatoduA 


rfoJtrlw  ©f  i/JOJUida  j  ift>  -i~3  cl  ns  -ctnl  (wrted  afsbaX) 

' 


9rid  ©v9±^Js*i  /I  Jnsupsedua  Iliw  JB9U09  i  rioiBaa  a  darid  oa 


rioXdaoJt'11 easlo  ©XdafliodjJB  .nlafiA  .*,dn9musob  dnavslen 


.-.ilxgfcftl  drid  yd  bsdoelse  eflrra*  X©  ^  arid  yd  b©nlfln©>t©b 


34 


There  are  two  prevailing  methods  by  which  to  approach 
the  problem  of  assigning  documents  to  a  classification 
system.  One  is  that  a  classification  system  first  be  drawn 
up,  and  that  documents  then  be  assigned  to  these  classifica¬ 
tions  (an  a  priori  system).  The  other  is  that  a  system  of 
classification  be  derived  from  a  set  of  test  documents  and 
that  subsequent  documents  be  classified  with  respect  to 
these  classes. 

A  third  method,  one  which  will  become  more  important 
in  the  future,  was  suggested  by  Becker  and  Hayes  (1963). 

The  classifications  could  be  automatically  reorganized  from 
time  to  time,  depending  on  the  current  uses  of  the  file. 

This  method  implies  that  statistics  be  kept  on  the  use  of 
the  system,  and  that  the  system  be  reorganized  periodically. 

Maron  (1961)  pioneered  some  work  along  the  a  priori 
lines.  His  approach  can  be  divided  into  two  parts.  The 
first  part  consisted  of  selecting  a  set  of  documents  - 
in  this  case,  405  abstracts  from  the  IRE  Transactions  on 
Electrical  Computers ,  Volumes  2,  3  and  4,  (1959)  dealing 
with  the  field  of  computing.  These  were  divided  into  two 
groups :  Group  One  was  used  to  determine  the  categories 

and  index  terms  while  Group  Two  was  used  to  validate  the 
results  of  Group  One.  Thirty-two  categories  best  describing 
these  documents  were  drawn  up  manually  (the  classification 
scheme),  and  the  documents  of  Group  One  were  indexed  according 


smooed  liiw  rioJtrtvr  ©no  ,bo  '4e.i!  brrirtt  A 


nro'i'l  bosineg'iod’i  ^IlBoidfirnoduB  sd  br.uob  snoi  is  1  IhbIo  srT 


c-it  .■  -  ■ .  ..  ■  -  ■  i 


-  d<f  no  Jqi>l  9d  ao'KSs tt&Je  terfd  a*±L<  al  bottom  elrfT 


tf'iow  9fnoe  bsTOdnolq  (Xd^I)  nonfiM 


,  gnl^uq  too  r:o  olai*’  9\i-  -iJ  w 


og^so  ode  9nimi9^9b  oi  ba au  bbw  9&0  quoiO 


gn.tdX'iDeofc  ct89d  sMTogedBO  owJ-^dildT 


afonl  a'xew  anO  quo'sO  lo  atffldimjoob  arte  bna  «  (  ©marios 


35 


to  these  categories.  A  group  of  clue  words  was  chosen 
which  best  described  the  documents.  A  category  versus 
clue  words  matrix  was  set  up  by  sorting  the  documents  into 
their  proper  categories.  A  correlation  matrix  was  calculated 
which  gave  the  correlation  of  each  clue  word  to  every  other 
clue  word.  In  the  second  part,  the  documents  of  Group  Two 
were  used  to  test  the  effectiveness  of  this  set  of  clue 
words  and  the  categories.  These  validation  documents  were 
automatically  indexed  by  frequency  of  occurrence  of  index 
terms,  and  their  index  terms  were  used  to  place  the  document 
into  the  proper  classification.  Thq  author  used  a  Baysian 
approach  to  predict  to  which  category  a  document  would 
belong  by  the  clue  words  it  contained.  The  Baysian  pre¬ 
diction  formula  was  used  in  the  following  form: 


P(C. )P(W. Ic. ) . . .P(W  |c. ) 

(3.6)  P(C,  |W  ,W  .  .  .  ,W  )  =  - J — • — - — J - 2 — L- 

0  1  ^  P(W  )P(W  ) . . ,P(W  ) 


where  P ( C . I Wn ,W0 
J  1  l5  2 


,W  )  =  probability  that  a  document 

belongs  to  class  C.  given 

J 

the  occurrence  of  clue  words 

W-^ ,  .  .  .  ,  wn  j 


P(C_.)  =  probability  of  class  C.; 

J  J 

P(W. |C.)  =  probability  of  word  i  occurring  in  class 
^  J 

P(VL  )  =  probability  of  word  i  occurring  at  all. 


. 

I  gfiiwo-'Iol  erf:*  ni  bfm  nltfinaol  noiJcib 

n9vla  a  «e«Io  »*  «8f.^d  .  "  "  •  ] 

v 

m  \  i 

i  ro  --I-  1 ! 

b^ow  *io  2 ..  irfoiq  ■  (  *0  .W)^ 


36 


The  formula  (3.6)  is  valid  subject  to  the  assumption  that 
the  clue  words  occur  in  a  statistically  independent  manner. 

Maron's  experiment  resulted  in  84.6  percent  of  the 
documents  of  Group  One  being  classified  the  same  as  an 
independent  classification  by  humans,  whereas  the  Group 
Two  classification  corresponded  to  51.8  percent  of  the 
manual  classification.  These  results  demonstrated  that 
the  classifications  were  better  than  chance,  though  not 
yet  as  accurate  as  manual  classification. 

Borko  and  Bernick  (1963  and  1964)  followed  up 
Maron’s  work  by  conducting  more  experiments  on  the  same 
set  of  405  abstracts  and  comparing  results.  Categories 
were  generated  by  applying  factor  analysis  to  the  cor¬ 
relation  matrix  of  index  terms  (not  an  a  priori  system) . 

They  contended  that  a  mathematically  derived  classification 
system  would  be  more  descriptive  of  the  class  of  documents 
and  more  amenable  to  automation  than  an  a  priori  system. 

This  system  would  also  be  independent  of  time,  i.e,,  as 
more  and  different  index  terms  occur  and  as  new  categories 
arise,  they  may  be  incorporated  into  the  system  without 
difficulty.  In  the  experiment  described  below,  a  manual 
classification  of  the  documents  was  used  as  a  control  group. 

Three  hypotheses  were  tested  as  follows.  First,  by 
using  the  Baysian  prediction  formula  on  the  a  priori  cate¬ 
gories,  a  more  accurate  classification  will  result  than  if 


qjj  ^ewollo^  (^£1  bnfi  £d£f)  alaJt/rrsa  bn/  oriaofi 


' 


■ 


37 


a  modified  factor  score  resulting  from  factor  analysis  is 
used.  Factor  analysis  was  performed  on  the  90  index  terms 
suggested  by  Maron .  Twenty-one  categories  were  derived, 
and  these  categories  were  used  to  classify  both  Groups  One 
and  Two.  These  results  indicated  that  the  two  automatic 
classification  systems  agreed  to  a  large  extent  on  both 
Group  One  and  Group  Two,  but  that  the  results  were  not  as 
impressive  when  applied  to  Group  Two  alone.  However,  there 
was  enough  evidence  that  automatic  classification  is  better 
than  random  classification. 

To  test  the  second  hypothesis,  Maron’s  list  of 
descriptors  was  replaced  by  one  compiled  by  Borko.  This 
new  list  of  words  in  Group  One  descriptors  was  derived  by 
a  frequency  count  of  the  significant  words  in  Group  One. 

A  correlation  matrix  denoting  the  correlation  between  index 
terms  was  factor  analyzed  into  21  different  categories. 

Then,  the  test  used  with  Hypothesis  One  was  applied  to 
these  new  index  terms  and  categories.  The  results  relating 
to  Group  One  revealed  an  increase  in  correct  classification 
(by  manual  standards)  by  both  Baysian  and  factor  score 
methods.  On  Group  Two,  the  Baysian  method  correctly  classi¬ 
fied  55*9  percent.  Hence,  both  methods  are  approximately 
equally  effective  in  classifying  documents. 

The  third  hypothesis  tested  was  that  a  classification 
set  derived  from  factor  analysis  would  result  in  a  larger 


3l8vl3nB  t.oSob'I  mo’ll  an!  *1089-1  ■5'iooa  'locfofil  b9llibom  e 

. 

■ 

e^i’iossJso  *n9’i©'nib  IS  otnl  bfrS^lBna  io^ob!  ssw  arms* 


gnl^sls’i  ed-Luesi  arfT  .  asl’xoso^BO  bnB  anna*  xsbnl  wsn  989rf* 

. 


‘ 


isbIo  ^IdosTioo  borf*9fn  n&XB\sdi  9ri*  towT  quo-xD  nO  ,aborf*9tn 


^l9i6mixo*iqqB  9-ib  eborictem  rt*od  t9sn9H  .dnsoi9q  9.  £2  b©ll 


*£rf*  aew  beJesJ  aieeridoq^d  b-ilrt*  ©riT 


38 


percentage  of  correct  classification  than  an  a  priori 
system.  The  results  verified  that  this  was  true  to  a 
greater  extent  when  applied  to  Group  Two  documents  (the 
important  set)  than  to  Group  One  documents.  The  results 
of  the  experiments  by  Borko  and  Bernick  indicate  that 
automatic  classification  of  certain  types  of  documents  is 
feasible.  The  limiting  factor  is  the  quality  of  the  indexing. 

Doyle  (1965)  published  a  survey  of  present  classifica¬ 
tion  techniques  which  differentiated  between  two  trends  of 
thought,  viz,  automatic  indexing,  which  implies  a  complete 
search  of  all  stored  documents  to  satisfy  a  retrieval 
request,  and  automatic  classification,  which  aids  in 
limiting  the  search  and  in  organizing  the  data.  These  two 
points  of  view  were  reconciled  and  this  discussion  was 
followed  by  an  analysis  of  a  method  to  increase  the  quality 
of  classification.  An  experiment  was  described  which  indi¬ 
cated  that  higher  quality  classification  results  as  more 
information  about  each  document  is  retained.  From  12  to 
36  indexing  terms  were  used  to  describe  each  document. 

The  test  consisted  of  measuring  the  indexing  capabilities 
of  each  set  of  indexing  terms  against  six  human  criteria 
by  varying  the  number  of  index  terms  retained,  beginning 
at  12  and  increasing  to  36.  Four  of  the  criteria  were 
satisfied  as  more  information  was  kept  about  each  document. 

To  increase  the  quality  of  the  classification,  the  amount 


al  ectns  Tjjoob  lo  nls^ioo  lo  noldsoil  las&Lo  oUamoluB 


Isvei* 3  v.laldBa  od  ainamwoob  bd'xo.dfi  II*  lo  doriBse 


n  ab  1 3  rioXfiw  «  no  Idas  II  It  ^slo  oX^*ebo^l-3  bn*  ^dasupei: 


. B'j&b  arid  ^nJtsInfiano  cU  5ns  rione^a  9<id  anldhall 


/■tllBvp  arid  sasanonl  od  bondsra  b  Ip  ele^IfinB  ruB  ^d  bewollol 


-Ibnl  rfolriw  badlToesb  jbw  ^.ismi^sqxB  rtA  . noidBOllIsefllo  lo 


.bsnlBdBTL  U  dna  u/oob  doss  ouotfB  nod  darnel  nl 


' 


<bt)fj:R^9'i  snrced  xeonl  lo^  ie6m  p  9ri  anl yusv  yd 

9*I9W  BlTSdlTO  arfd  10  'UfO*? 


^noidsoiliaaslp  ©rid  lo  ydllsup  add  eese^onl  oT 


39 


of  information  retained  should  be  increased.  This  increase 
will  also  increase  the  storage  requirements  accordingly. 

Automatic  classification  has  distinct  possibilities. 

If  classification  could  be  mechanized  so  that  75  percent 
accuracy  were  attained  (as  compared  to  the  manual  system) 
then  machine  capabilities  would  approach  those  of  humans. 
Presently,  there  is  insufficient  data  available  to  auto¬ 
matically  classify  documents  in  the  field  of  history. 

3 • 4  Other  Areas  of  Computer  Analysis  of  Text 

Once  information  is  in  machine  readable  form,  com¬ 
puters  can  be  used  to  produce  a  concordance  of  text 
material.  A  concordance  is  an  alphabetic  analysis  of 
the  text,  word  by  word,  together  with  details  of  the 
location  in  which  each  word  is  used.  Some  of  the  first 
work  in  the  area  of  computer  produced  concordances  was  by 
Tasman  (1957).  A  complete  concordance  of  Summa  Theologiae 
by  St.  Thomas  Aquinas  was  compiled  from  1.6  million  punched 
cards.  The  occurrence  of  each  word  was  summed,  and  a  print¬ 
out  gave  the  frequency  of  occurrence,  word  usage,  and  place 
of  occurrence  of  each  word.  Later,  Silva  and  Bellamy  (1965) 
experimented  with  a  concordance  generator.  In  it,  English 
and  French  text  could  be  processed,  and  a  selected  con¬ 
cordance  of  the  text  could  be  produced.  Word  counts  and  an 
index  were  also  produced. 


dnaonaq  darid  oa  n^slnarioact  9d  bluoo  noitGoUieealo  IT 


,  e.ijfvRiurl  lo  aaorii  tfo&oiqqa  £>.  uow  eeldJil  if  Bqao  aniriofim  narid 


isfMO 


aldfldab  ridlw  naridagod  «bnow  yd  bnow  tdxad  arid 


denll  arid  lo  amoS  .baau  a 2  bn ow  rfoaa  riolriw  ni  noidBool 


ooBbnoonoo  baoubonq  naduqmoo  lo  Ban*  arid  ni  >Jnow 


■ 


)3riT  Br.tnuS  lo  so'ifbnoonoo  9dalqmoo  A 

-dndnq  £■  b  y  ^bammue  aaw  blow  does  lo  eonamuooo  ariT  . ebnBO 


v^slIsS  bn&  avLIZ  ^adad  .  hnow  rioaa  lo  abna*rxuooo  lo 


. 

-noo  bsdo9l9B  b  bos  'baacaBOiq  ad  bluoo  dxad  rionan’!  bna 


. baoubonq  ad  bluoo  dxad  arid  lo  aonBbnoo 

.baoubonq  osIb  anaw  X9bnl 


40 


Linguists  are  finding  many  uses  of,  and  variations 
on,  the  computer  produced  concordance.  It  may  well  be  that 
the  next  step  will  be  completely  generalized  machine  in¬ 
dependent  concordance  generators  free  of  the  upper  case 
typing  limitation  of  present  day  computers. 

3 . 5  Discussion 

Once  text  material  is  in  machine  readable  form,  the 
application  of  a  computer  to  index  and  classify  documents 
automatically  seems  not  only  feasible,  but  highly  desirable. 
Although  computer  indexing  may  not  be  as  accurate  as  manual 
indexing,  computers  have  the  advantage  of  consistency,  and, 
to  a  large  extent,  of  predictability.  Much  more  must  be 
done  in  this  area  if  indexing  and  classification  are  to 
keep  abreast  of  the  present  information  explosion  that  is 
occurring  in  both  scientific  and  non-scient ific  literature. 

There  are  many  reliable  papers  and  books  on  the  subject 
of  automatic  analysis  of  text.  Walston  (1965)  gives  a  wide 
survey  of  the  methods  used  to  date;  in  text  form,  Becker  and 
Hayes  (1963)  present  the  librarians’  point  of  view.  Bourne 
(1963)  has  a  very  readable  survey  of  present  methods  together 
with  an  excellent  bibliography.  Vickery  ( 1 9 6 5 )  presents  the 
material  well  in  a  more  sophisticated  and  formal  way  than 


Bourne . 


.food  bn«  8*ioqfiq  .feI0fif£J4*x  ^aem  ato  e'leriT 


bflfi  -i  oio  9  nrcot  ixei  :  :  :  *  •  ■  od  be-  ;  sborftotn  9 fit  to  ^$VW8 


‘ 

.  ^riqB*T80.XXdifil  tnoXXeoxo  tub  ridiw 
ns  .It  ^sw  bns  bo  36dX3i-  iriqoe  o*iom  6  nl  XXok  XainodBin 


CHAPTER  IV 


OPERATIONAL  INFORMATION  STORAGE  AND 
RETRIEVAL  SYSTEMS 


4 . 0  Introduction 

Many  information  storage  and  retrieval  systems  are 
presently  implemented  on  computers.  A  selection  of  these, 
representative  of  the  current  state  of  information  systems, 
will  be  reviewed  in  this  chapter. 

The  systems  fall  into  two  categories:  those  imple¬ 
mented  in  a  batch-processing  environment,  and  those  designed 
for  a  real-time  time-shared  environment.  In  most  applica¬ 
tions  of  the  first  category,  two  functions  are  to  be 
performed:  to  aid  in  the  publication  of  abstract  and 

announcement  journals,  and  to  do  retrospective  searching. 

The  real-time  systems  are  used  primarily  to  satisfy  a 
request  for  information.  Other  ancillary  functions,  such 
as  the  Selective  Dissemination  of  Information,  can  be 
implemented  on  these  systems. 

The  batch-processing  systems  to  be  reviewed  are  the 
PICUPS  system  at  the  U.S.  National  Agricultural  Library, 
the  MEDLARS  system  at  the  U.S.  National  Library  of  Medicine, 
and  the  HAYSTAQ  system  at  the  U.S.  Patent  Office.  The  real¬ 
time  systems  are  the  CONVERSE  system  at  Lockheed  Missile 
and  Space  Company,  the  TIP  project  at  Massachusetts 


■  n  3  :  I-  \Z  -  •  •  »^T 


If61  ,riT  .*oinO  .tA-  .arts  -'£  nraJsiCE  ©ATc  -’AH  art:  bfi* 


[  slleam  baa rtjU>oJ>Ja  n.aie*e  32H3VH00  erf?  bib  »«»W 


aJJaaurioBaaeM  Jfl  Joaloiq  «S  aril  .YnaqBoO  eoBqS  bo* 


42 


Institute  of  Technology,  the  SMART  system  at  Harvard 
University,  and  the  BOLD  system  at  System  Development 
Corporation . 

4 . 1  Batch-Processing  Systems 

The  three  systems  to  be  reviewed  are  typical  of 
the  implementation  and  aims  of  batch-processing  informa¬ 
tion  systems.  They  function  as  a  mechanical  aid  to 
production  of  publications  as  well  as  to  retrospective 
searching.  However,  the  HAYSTAQ  system  is  used  exclusively 
for  retrospective  searching. 

It  is  not  yet  economical  to  dedicate  an  expensive 
computer  system  exclusively  to  the  searching  of  literature. 
Hence,  most  current  systems  are  justified  economically  on 
their  added  benefits,  such  as  production  of  publications. 
Input  to  the  batch-processing  systems  is  through  a  card 
or  paper  tape  reader,  and  output  is  to  magnetic  tape  (for 
offset  printing  or  typesetting)  or  to  a  printer. 

4.1.1  The  PICUPS  System 

The  PICUPS  (Pesticides  Information  Center;  Update, 
Publication,  Search)  system  developed  for  the  National 
Agricultural  Library  by  Datatrol  Corporation  (1965)  is 
designed  with  two  purposes  in  mind:  retrospective  searches 
to  satisfy  inquiries,  and  the  publication  of  announcement 
journals.  It  is  designed  with  a  maximum  of  flexibility  to 


bTav*iBH  3E  i nt>f8\a  THAM8  Brie  , xS°^oririI)^T  10  •toJldenl 


I . 


' 


of  bis  iBoinsrioem  &  bb  notfoau'l  ^srlT 


vl9v±BJj.Iox9  beau  a t  019^8^8  QATSYAH  9 rtf'  ti9V9woH 

.3nlriO*IB98  9ViJ0  9q80cl39'I  aol 

; 

. anoldsol  duq  lo  noidouboiq  sb  rioue  tedHe/iecf  bebbs  Tbsrie 


•>  ■- 


biBD  b  ffguoririd  a  I  smsdaye  gni:8890oTq~rio^Bd  9rid  of  duqnl 


nol)  9qsd  old9ngam  of  al  duqduo  bne  ,T9bB9i  aqad  *x9qsq  to 
.'isdnirrq  b  of  no  (sniff aaeqxf  no  anldni'io 


[?i9J3v3  awoii  9rff  i/i.t* 


9rid  iol  beqolevab  ma^eya  (done 98  cnoldBoiIdwcI 


■ 


39rfo'iB98  svldoeqeo'ide'r  :bnim  nl  89eoqnuq  ow f  rtflvt  bengl 89b 
insmsofiuonrtB  lo  noldBOliduq  9ri t  bnfi  1 891*1  tupal  yleldse  of 


of  vltlldlx 9II  lo  mi/mlxEm  b  ridlw  benglasb  el  dl  .  elfirmjot 


43 


allow  changes  and  modifications  to  the  system  with  a  minimum 
of  disruption.  The  PICUPS  system  required  an  estimated  470 
man-days  of  programming  time.  The  time  spent  from  initial 
feasibility  investigation  to  completion  totaled  15  months. 

The  documents  to  be  stored  in  the  PICUPS  system  are 
indexed  to  two  levels  by  a  professional  staff:  the  first 
level  is  for  publication  and  the  second  is  for  retrospective 
searching . 

The  indexed  documents  are  typed  on  a  form  and  manually 
edited,  and  the  corrected  forms  are  converted  to  magnetic 
tape  by  an  optical  scanner.  The  vocabulary  in  the  PICUPS 
system  is  well  controlled,  with  generic  structuring  and 
many  cross  references.  New  terms  are  added  as  the  need 
arises.  The  vocabulary  file  can  be  updated  easily  with 
additions  and  deletions  of  terms  and  cross  references. 

There  are  two  files  of  bibliographic  references  and 
associated  descriptors.  The  Issue  Pile  contains  all  the 
recently  indexed  issues  of  journals  and  articles.  It  is 
less  voluminous  than  the  Master  File,  which  contains  all 
the  bibliographic  data  ever  processed  by  the  system. 
Periodically,  the  Issue  Pile  is  used  to  update  the  Master 
Pile. 

Information  for  the  publication  of  announcement 
journals  is  retrieved  from  the  Issue  File.  The  file  is 
scanned,  and  a  magnetic  tape  is  produced  containing  all 


.gnirio'iBse 


bnB  *  j e  oli9ft93  rillw  t 59  loiinoo  I  9w  al 

bee  i  3£  bsbtB  9 is  anrcsl  vaM  .  890f!9i9l9i  eaoio  ^nsm 

■ 


bflB  e90fI919l91  9 IrtqB'IS0-*-^-^  lO  89111  OWtf  916  919fiT 


srtt  X 1 £  enls^noo  eit'S  euaal  9dT  .aio^qli089b  bscJBiooasiB 


.89lolJiB  bnjj  alermio t  lo  esuaal  baxsbnl  ^I^nsosi 
triw  «  sli*?  iscfasM  9ri3  i*sd$  swonlmrlov  aa«I 


-■ 

’ 

.  ^ir  _ T  a  -y  ,rran»Krt^<I 


—  r  |  ^ 


.£ 


.9X1^  ©jjaal  srtJ  moil  b9\9li^9i  al  alfiniuot 


Xis  snlnlB^noo  beoi/boiq  si  eqs^  jl^engBm  b  >na  , ndfimsoa 


44 


the  information  necessary  for  the  publication  of  the 
announcement  journals.  The  magnetic  tape  is  in  a  format 
suitable  for  input  to  the  Linotron,  a  typesetting  machine. 
Hence,  all  publications  are  of  high  quality  print,  with  a 
wide  variation  in  type  fonts. 

Retrospective  search  queries  use  weighted  index  terms 
connected  by  Boolean  operators.  Requests  are  screened  by 
one  out  of  eight  professional  staff  members;  they  are  then 
mechanically  expanded  to  include  generically  lower  terms 
and  cross  reference  terms.  The  Master  File  is  searched 
for  document  titles  that  satisfy  the  requests  and  they 
are  output  on  magnetic  tape  for  offline  printing  on  the 
computer’s  printer. 

The  PICUPS  system  is  a  standard  information  storage 
and  retrieval  system;  it  uses  magnetic  tapes  as  storage, 
and  scans  the  entire  file  in  response  to  a  request.  User 
feedback  to  the  system  is  almost  nonexistent,  the  vocabulary 
is  well  controlled,  and  the  search  strategies  are  straight¬ 
forward  and  uncomplicated. 

4.1.2  The  MEDLARS  System 

The  MEDLARS  information  storage  and  retrieval  system 
(General  Electric  Company  (1963))  has  two  functions.  It 
is  used  for  retroactive  bibliographic  searches  on  stored 
documents  and  it  produces  many  of  the  publications  of  the 


s  ritfiw  tdnta a  ydilBiJ p  rf&iri  To. one  anoldfi  oil  di/q  LLb  «oon®H 


nsdd  ens  yerfct  ^nadmsra  11b  Js  iBnoiaaslonq  icicle  lo  tuo  9fio 


.  armsi  ©onerrelsn  esono 


gnldniiq  amino  *iol  9q»d  oidenssirr  no  JuqJuo  ana 

.rredninq  a'naduqmoo 


. 


.g^nois  sb  saqs*  oldensBin  393U  it  jrasiaye  lave  t*xd«*x  bns 

. 


ane  e9l;g9;tB*tJe  rfo'*£*>®  9  d  ton*  « foal  ton  fnoo  Haw  ai 

j 

. 


. 

il  .enolJorun  owd  sd  ynsqnioO  oindoaia  Is'ienaO) 

benoda  no  89doif98  oirfqBnsoildicf  ovldoaondan  nol  beau  el 
anoi dBoilduq  9tiJ  lo  ynem  sooi/bonq  it  tons  ectnamuoob 


45 


National  Library  of  Medicine,  Washington,  D.C.  Index 
Medicus ,  Cumulated  Index  Medicus  and  the  List  of  Journals 
Indexed  are  but  a  few  of  these  publications.  Preliminary 
investigation  and  design  of  the  system  required  three 
months;  implementation  required  an  additional  two  years. 

The  data  base  of  the  MEDLARS  system  consists  of 

journal  articles  and  published  monographs,  both  English 

% 

and  foreign,  which  are  well  indexed  by  a  staff  of  pro¬ 
fessional  indexers.  This  indexed  information  is  then 
transferred  to  paper  tape  and  a  typewritten  sheet  by  a 
Friden  Flexowriter,  and  the  data  are  edited  manually. 

In  particular,  consistency  and  thoroughness  in  indexing 
is  checked.  The  paper  tape  is  then  edited  by  the  computer, 
and  unit  records,  one  for  each  article  or  monograph  indexed, 
are  output  on  magnetic  tape.  This  tape  file  is  subsequently 
used  for  updating  the  Master  File  of  unit  records,  and  also 
for  publication  of  Index  Medicus .  A  set  of  magnetic  tape 
files  exists  which  contains  all  information  necessary  to 
verify  indexing  terms  as  well  as  journal  titles,  etc., 
used  for  the  publications. 

The  edited  unit  record  file  is  sorted  by  subject 
heading,  and  reformatted  to  a  form  compatible  with  an  off¬ 
line  printing  device  called  GRACE  (GRaphic  Arts  Composing 
Equipment).  GRACE  is  a  system  which  produces  justified 
photographic  copies  of  pages  suitable  for  offset  printing. 


s  See  As  nadd  Ji'iwsq^d  &  bne  aq&d  isqsq  od  bs't'islzn&i* 


-‘.nr.  vjo'iodl  J  ns  \jonsd8Jtanoo  ,  siuoidusq  nl 


' 


.3qs$  olten^sm  no  SisqSuo  b'ib 


. 


■ 


dost*^8  Yd  betioe  ai  bibaa'a  Stew  beSlbe  ariT 


beilideut  esoubo'xq  rioiriw  mada^e  s  ei  20AHD 
.Sfiddnlaq  d^allo  aol  9ldBd.ti;e  aogfiq  lo  aslqoo  Dlrfqfl'iaodoriq 


46 


The  publications  produced  by  GRACE,  such  as  Index  Medicus , 
are  of  high  quality  print,  many  type  fonts,  and  full 
j ustification . 

The  entire  Master  File  of  unit  records  may  be  used 
in  a  retrospective  search.  A  search  request,  submitted 
as  an  expository  English  paragraph,  is  coded  by  an  in¬ 
formation  specialist  familiar  with  the  MEDLARS  system, 
its  operation,  and  the  index  terms  in  a  form  compatible 
with  the  computer.  This  request  is  then  batched  with 
other  requests,  and  the  entire  unit  record  file  is  searched. 
However,  instead  of  attempting  to  satisfy  every  search  in 
its  entirety  on  the  first  pass,  a  second  file  of  unit 
records,  much  less  voluminous  and  more  easily  manipulated 
than  the  entire  Master  File,  is  produced  by  the  initial 
scan.  This  screened  file  is  then  rescanned,  and  all  docu¬ 
ments  satisfying  the  requests  are  output  on  a  magnetic  tape 
suitable  for  either  off-line  printing  on  the  computer’s 
printer  or  for  offset  printing  on  GRACE. 

The  search  request  is  formulated  as  a  set  of  coded 
index  terms  connected  by  Boolean  operators.  For  example, 

R:  (Ml  +  M2)  *  (M3  +  M4) 

can  be  used  to  retrieve  documents  containing  coded  index 
terms  (Ml  and  M2)  or  (M3  and  M4);  +  corresponds  to  "and" 
and  *  corresponds  to  "or". 


eXdX JjBqmoo  iirrol  s  nX  er*r:9d  xsbrtX  9(1^  tn.B  ,n jX^bt  ■  jo 


.borfoiBse  sX  e,  11  Modot  dim,  91X^9  &rf?  ftnB  ted89Lp9*':  ^rfdo 


09^B  uqinsm  ^IXess  siom  brr*  auonlrnjjlov  eael  rtou  l  ,eow9'i 
IsXdXnX  9rid  bsoubo'iq  sX  tsli*  ladafiM  9iXdns  ®rid  nflrfd 


-jjoob  XXb  bns  .bennfl089i  nsrid  eX  91X1  b«rt9®'ioe  eirfr"  ,nsoe 

■ 


baboo  1o  dee  s  as  bsdBlifnnoJ  al  dadupei  rfc'iBBe  ertT 


. 

.  el:  '6X9  *xo^  .gTOdBTsqo  nsaiooS  \6  be ‘osnnoo  ennsd  xebnX 


(h:  ♦  EM)  *  (?M  •  1M) 


."*10”  od  abnoqasTioo  •  bns 

' 


47 


Like  most  current  commercial  information  systems  the 
data  base  is  not  restricted  to  retrospective  searches. 

The  publication  of  an  abstracts  journal  of  recent  liter¬ 
ature  such  as  Index  Medicus  is  often  the  prime  objective 
of  such  a  system,  rather  than  the  retrospective  searches. 
MEDLARS  is  designed  to  increase  the  quality  of  Index 
Medicus ,  as  well  as  speed  its  publication.  Retrospective 
searches  are  a  secondary  result  of  having  machine  readable 
documents.  Also  resulting  from  the  mechanized  system  is  a 
large  body  of  medical  information  stored  in  a  machine 
readable  form  and  suitable  for  analysis  by  organizations 
other  than  the  National  Library  of  Medicine.  An  article 
in  the  Journal  of  Data  Management  (1966)  gives  an  example 
of  the  use  of  unit  records  received  from  the  MEDLARS 
system  in  schizophrenia  literature. 

4.1.3  The  HAYSTAQ  System 

The  HAYSTAQ  system,  developed  by  the  U.S.  Patent 
Office  (Marden  (1965)),  is  used  exclusively  for  retro¬ 
spective  searches  on  chemical  information  indexed  by 
chemical  structures.  Before  issuing  a  patent  on  any  dis¬ 
covery,  a  retrospective  search  on  all  previous  patents 
issued  must  be  carried  out  to  determine  if  the  article  to 
be  patented  is  unique.  One  of  the  most  active  fields  in 
patenting  is  that  of  chemistry.  In  order  to  keep  abreast 


anXrfoBm  b  nX  b^^ ode  noXdBflnoltti  XBoXbani  lo  ybod  ©s^ib! 


2HAJG3M  art  aot^  bavXaoa  t  ebiooa-i  dX»iy  lo  98U  srld  lo 


anjdB*i9dXX  BXne'iriqo.sXrfo8  nX  madaya 


£.1.* 


b7  .8.U  arid  yd  baqolavab  tmadeya  PATS  YAH  arfT 
-oidaT  'io'S  ylavleuloxe  basu  el  nab'xsM) .  aoi%10 


aX Did'xs  arid  IX  anXrmadab  od  duo  balnea  ad  daura  beu88l 

aX'i  avidae  daooi  9dd  lo  a«0 

•  * 


48 


of  its  responsibilities,  the  U.S.  Patent  Office  has  developed 
a  mechanized  approach  to  the  problem.  The  result  is  a  com¬ 
puter  program  called  HAYSTAQ.  No  figures  are  available  on 
the  time  required  to  design  and  implement  the  HAYSTAQ  system. 

The  HAYSTAQ  system  depends  upon  the  matching  of 
chemical  structures;  no  other  information,  such  as  process 
or  physical  properties,  is  used  in  indexing  the  file,  al¬ 
though  this  type  of  information  will  be  used  in  future 
versions.  All  patents  on  chemical  information  are  coded 
by  a  team  of  professional  chemists.  Each  indexer  analyzes 
the  chemical  information,  draws  the  structural  diagrams  of 
all  chemical  compounds  used  in  the  claim  and  then  codes 
these  structural  diagrams  in  a  form  acceptable  to  the  com¬ 
puter.  This  information  is  punched  on  paper  tape  with  a 
Flexowriter,  machine  edited,  and  then  coded  and  compressed 
onto  a  magnetic  tape  file.  This  file,  which  uses  the  coded 
chemical  structures  as  indexing  terms,  is  used  for  retro¬ 
spective  searching. 

In  order  to  produce  a  workable  system,  chemical 
structures  must  first  be  separated  into  functional  groups 
by  the  search  program.  For  example,  a  complex  compound  such 
as  3-phenyl  propylamine. 


beboo  st*  nojtfsrnae^aJ:  laaJtmorfo,  no  sctnec+aq  IIA  .enolaTsy 


.gnirioTBea  sviiosqe 

' 

, sntraal^qoTq  Xynarfq-E  ea 


49 


can  be  separated  into  functional  groups  of  H  -  C 


H  H 

cx. 


H 


H 


H 


'C: 

H 


H 


-  C  -  C  -  C  -  and  -  NHL.  Any  of  these  groups  can  then  be 
H  H  H  d 

used  as  an  index  term  in  a  retrospective  search.  However, 
the  indexer  codes  the  entire  structure;  the  program  separates 
all  compounds  into  their  constituent  parts. 

Every  functional  group  is  assigned  a  unique  code.  For 
example,  the  functional  group  oxy ,  of  the  form  =  0,  is  coded 
3C0FA,  and  the  functional  group  thio,  of  the  form  X  -  S  -  X, 
is  coded  3COC8.  By  indicating  connections  between  functional 
groups,  in  the  same  way  as  semantic  links  and  roles  are  used 
to  connect  English  words  to  form  a  concept,  any  chemical 
compound  can  be  uniquely  represented.  It  also  can  be  coded 
directly  from  the  structual  diagram.  Thus,  a  representation 
of  the  structure  diagrammed  previously  may  be  as  shown  below 
where  the  functional  codes  are  arbitrarily  assigned  by  the 
author . 


From  Link 

Coded  Functional 

Group 

To  Link 

1. 

3C02A 

2(1) 

2. 

1(1) 

3C1C1 

3(2) 

_±J 

2(2) 

3C72A 

■=  j  3fl  t  equor:  j  sr.o  to  \*nA 


Jnstjj  t^anoo  o.t  li  otfnX  abrujoqflioo  Ilf: 


.adboo  9jj p J: i'XiJ  jb  fosngJtaaB  al  qiiOTS  Isnoictorurt  V**v 

c  0  *  urxol  ert;*  1c  ,^xo  quo-xg  I^noldonjjl  ©rict  telqrasx© 


50 


Each  link  field  contains  the  type  of  link  joining  the  two 
functional  groups  in  parentheses. 

Another  generic  extension  of  the  coding  scheme  used 
by  HAYSTAQ  is  the  Markush  function.  Instead  of  specifying 
a  functional  group  at  each  link  in  a  structural  diagram, 
a  set  of  functional  groups  may  be  specified.  For  example, 
a  compound  may  be  represented  by 

R1  -  0  -  r2 

where 

R1:  NH3  - 

H  -  0  - 

and 

R2:  -  0  -  H 

-  H 

Any  combination  of  R  and  R2  functional  groups  may  be 
used  in  the  original  structure.  By  using  this  coding 
scheme,  many  variations  of  similar  chemical  compounds  may 
be  represented  efficiently. 

Requests  are  coded  in  the  same  way  as  the  chemical 
information.  Chemical  structures  are  expanded  and  coded, 
and  the  entire  tape  file  is  scanned.  Each  entry  in  the 
file  is  matched  with  the  request  by  topologically  matching 
the  request  and  each  item  in  the  file.  If  a  complete  match 


ow*  edct  snJtniot.  Jin  II  *lo  aq^ct  od*  ertU^noo  blell  *nil  do£3 

.saaorfJns'xaq  oJt  squ<^s  l3nol^om;l 


-  0  -  H 


,J  . 


H  -  0  - 


H  - 


V.  -!”  ebnuoqmoo  Isoimsrio  £ljfc«±8  lo  anoicfsI'iBV  ^n/?tn  t9msdo8 

.>r4x«KK  ha-trt«k!>A»mft»t  tfbH' 


' 

6 HUB'S  eni  ri+  baboo  a*ifl  stfaaupaH 


.feannsoe  eJl  al  Ll  aqjBcf  anl^na  ad.;  bns 


^ f !  i;  oioqo ?  vd  dasu pai  adJ  dcfiw  berioHr.m  si  ali  i 


,»m  ad3  nl  maJJ:  doBe  bn£  Jeaupeq  adct 


51 


is  found,  it  is  output  as  satisfying  the  request.  However, 
screening  techniques  are  used  to  reduce  the  number  of  topo¬ 
logical  matchings.  For  example,  the  functional  groups  are 
divided  into  two  generic  classes.  If  the  more  generic  of 
these  classes  does  not  contain  all  the  functional  groups 
of  the  request,  no  further  matching  is  done;  if  it  does, 
matching  continues. 

This  system  is  used  exclusively  for  retrospective 
searching.  Many  of  the  difficulties  encountered  when 
dealing  with  natural  language  documents  are  not  encountered 
with  this  system.  A  request  is  either  completely  satisfied 
or  it  is  not.  Semantic  analysis  of  the  meaning  of  words  is 
not  necessary.  A  well  structured  and  fully  defined  language 
is  used  (the  structural  formulae  of  compounds)  which  has  no 
inherent  ambiguities.  Hence,  conversion  to  a  computer  is 
relatively  easy. 

4 . 2  Real-Time,  Time-Shared  Systems 

The  second  portion  of  this  chapter  deals  with  real¬ 
time  information  retrieval  systems  designed  for  time-shared 
computers.  Some  authors,  for  example  Licklider  ( 1 9 6 5 )  and 
Swanson  (1964),  feel  that  future  information  storage  and 
retrieval  systems  will  rely  heavily  on  the  concept  of  time¬ 
sharing.  Future  systems  will  require  "immediate"  response 
and  direct  communication  with  the  stored  information.  Since, 


Dlnr-nejj  enom  arfi  II  .eseaalo  o  *i9n©s  owi  o  ni  fcsbfvib 


aaob  il  II  j  sno£  si  anirioijsm  nsriinul  on  t3aevp9*i  arii  lo 

.  eeunbinoo  snirfoisai 


::1  abn ow  lo  gniaBsm  sdi  lo  aba^Isns  oiinsni98  .ion  el  ii  no 


on  a£d  rioiriw  ( ebnuoarnoo  lo  esljjnrrol  I^nuiounis  srii)  beau  el 


et  leluatrioo  £  oi  noi e'invnoo  t9on©H  . aBlii' uglcfme  insnarinl 


anreJay.S  baisriS-  -  - -  -  .- 

-IBS'!  riiiw  elssb  nsiqBrfo  eirii  lo  noiinoq  bnooss  eriT 


bensria-sinli  no"1  bsnaieeb  erTTsisys  Isveinion  noiiBnreolnl  smli 


bns  aaenoig  noliBimolni  enjuiyl  isrfi  I99I  «  'QI  noenewd 
-amii  lo  iqeonoo  »rii  no  ^d'1  JII-Clw  amsiexe  XBveini9n 

•.  rsoqsen  "©iBibarnmi"  sniupen  II  iw  smeie^a  enuii/-  .gnlnsria 
.  9onI2  .noiianrxolril  benoia  ©rfi  rfiiw  noiiBOim/mntoo  iosnib  bn$ 


52 


in  most  cases,  a  user  is  not  entirely  aware  of  what  he 
wants,  immediate  feedback  to  and  from  the  system  will  be 
required  in  order  to  narrow  down  the  request  to  a  form 
manageable  for  a  machine  search.  All  of  these  requirements 
imply  a  need  for  a  real-time  system.  However,  a  real-time 
system  cannot  be  justified  economically  unless  it  is  shared 
among  several  active  users.  Thus,  we  have  the  concept  of 
time-sharing . 

Pour  real-time  information  retrieval  systems  will  be 
discussed;  CONVERSE,  TIP,  SMART,  and  BOLD.  SMART  and  BOLD 
are  experimental  systems,  of  which  only  the  BOLD  system  is 
implemented  on  a  time-shared  computer. 

4.2.1  The  CONVERSE  System 

This  system  at  Lockheed  Missile  and  Space  Company 
(Drew,  Summit,  Tanaka  and  Whitely  ( 1 9 6 5 ) )  is  designed 
around  a  time-shared  computer  with  a  card  reader  input 
device  and  a  teletypewriter  output  unit.  The  data  base 
consists  of  two  magnetic  tape  files  produced  from  over 
8000  machine  readable  master  catalogue  cards  containing 
all  the  descriptive  information  of  a  wide  variety  of  docu¬ 
ments.  Pour  people  required  six  weeks  to  design,  implement 
and  test  the  system  after  machine  readable  documents  for 
testing  the  system  were  available.  This  amounts  to  120 
man-days . 


t»ri  dadw  To  *#xsws  *on  B  t.35BB0  deom  nl 


imol  s  otf  }3®up9'i  arid  nwob  woman  od  isbio  nl  beTlup^ 

smld-Iaa^  a  .ibvswoH  .msdeY®  amld-Iaa'i  a  10I  bean  a 


.‘■•9rxii;:e  i  '  -'  a  s <*  ■  c,  '  BDi:-oaood  'v  t  3trfc  )C  cn  *93  8^8 


' 

.S nl'iscis-eml* 


aaoa  beta  THAME  .OJOH  baa  .THAME  .HIT  .aSHarVWOD  ;beeeu5eJ:fe 


al  madeYS  QJOH  arid  yltto  riolriw  To  .  eflisdaYa  Isdnem.i.'ieqxs  9riB 


m-JnV2  :r  i:2»  ’  5'  --iiT  I.S.H 


rg9dgY^  33fl»' '.VliOD  eriT 
YHBqffloO  9oaqS  beta  sIIbsIM  bearitfooJ  3a  madeYB  sJtriT 

benglasb  al  ((S^Pl)  bns  artanaf  .dirmuE  .wsrrd) 

duqnl  lefoBQn  bnao  b  ridlw  rt?dyqmoo  ba'i^ris-smid  s  bfitfOT.a 


5 d  BdBb  9rtT  ,3lr.j  luqluo  it  J±^.wfTY  Ui  >d  o;  jobveb 


j 


■ 


insflislqml  ,  rrsdaab  od  aXssv/  xla  b9'ilJJp9*x  alqeaq  *iu©H  .  Ednern 
• . o°t  3+n  .mioob  9idBb69*i  9 nldoam  'teJIa  m9dBYR  sdd  dead  6as 


.  e  dsi  ti-.v  ,  *  v  •.•.-'.me-  ?eyb  9rfd  tits  ' 

.aYBb-nain 


53 


In  order  to  use  the  system  a  user  selects  the  group 
of  descriptors  best  describing  the  articles  desired.  A 
file  of  punched  cards  containing  all  descriptors  in  the 
system  stands  beside  the  input  terminal.  There  is  also  a 
card  reader  capable  of  reading  a  single  manually  entered 
card  at  a  time.  The  desired  descriptor  cards  are  selected 
from  this  file  and  arranged,  along  with  control  cards,  in 
a  deck  to  be  read  by  the  terminal.  After  reading  all  of 
the  selected  cards,  the  system  outputs  on  a  teletypewriter 
the  documents  that  satisfy  the  request.  If  three  or  less 
documents  are  found,  all  the  bibliographic  data  are  printed. 
If  four  to  15  are  found,  the  document  reference  number  is 
output.  If  more  than  15  are  found,  a  count  of  the  number 
of  documents  is  output.  The  user  can  then  refine  and 
rephrase  his  query  and  input  the  request  again. 

Two  magnetic  tape  files  are  used  for  retrieval.  The 
first  file  consists  of  all  descriptive  information  appear¬ 
ing  as  output.  The  second  file,  created  from  the  master 
catalogue  cards,  is  an  inverted  file,  i.e.,  each  descriptor 
is  followed  by  a  list  of  documents  using  that  descriptor. 
Descriptors  corresponding  to  all  categories  are  identified, 
grouped,  sorted-  alphabetically  within  categories,  and  man¬ 
ually  edited.  Non-significant  terms  are  purged,  and 
synonyms  are  grouped  together.  The  descriptors  are  then 
punched  on  cards  and  sequenced  for  future  retrieval 


quona  art}  e}09jt*s  T9 eu  e  m9ls^«  9.13  aeu  o 3  isbio  nl 


b«n9Xrt9  vXIsjjnBfli  slgnie  fi  *io  *Xcf£qB9  isbAo*  b^rso  ^ 


al  i&dmisn  9onerr9'l9T  dnamubob  etid  tbruK>*l  9*is  5X  ot  *xiiol  II 
medtmjn  9tid  Jo  dnuoo  &  <bnuol  b'ib  51  nscid  siorcr  II 

.  1 1 ; '  ]  j  v  •■>  at  s  3 1  ’  t-  n  u  c  o  b  I  b 


'isdasm  srid  mo*!!  bsdBSTO  ^11**  bnooss  arfT  .Jvqduo  SB  anl 


■ 

•  ' 

*  ...  .  .  •■«. 


' 


.bsnunsbl  ^b  esi'iogedfio  C Ib  od  snJtbnoqesnoo  aiodqlioesa 

■  < 


— ruB.r!  brie  t39l'iog9JB0  n.  cijlv  YllBoiitsdfiriqlB  JS>9d*ioe  .bsq.LO'is 
briB  <&98rijjq  svb  ermsd  dnBolJ Ingle-ci oW  .bsdtbs  ^IXbu 

V  , 


54 


requests.  This  file  is  searched  in  response  to  requests. 

Future  design  improvements  include  the  use  of  type¬ 
writers  as  input  devices,  and  a  "free”  vocabulary,  i.e., 
the  user  is  not  constrained  to  the  vocabulary  of  the 
machine.  All  user  vocabulary  will  be  translated  into  a 
form  compatible  with  the  system,  and  a  search  on  the 
translated  terms  will  take  place.  Also,  attempts  will 
be  made  to  decrease  the  amount  of  data  to  be  searched  by 
means  of  a  screening  process  designed  to  retrieve  a  set 
of  documents  containing  the  documents  requested. 

Although  this  system  is  crude  in  its  manner  of 
operation  compared  to  the  systems  to  be  described,  it 
is  one  of  the  first  operational  information  retrieval 
systems  employing  a  real-time,  time-shared  approach. 

4.2.2  The  Technical  Information  Project 

The  Technical  Information  Project  at  Massachusetts 
Institute  of  Technology  (Kessler  (1965a))  is  one  of  the 
few  information  retrieval  systems  which  is  operational 
on  a  time-shared  machine.  It  is  designed  around  the 
experimental  Project  MAC  complex,  a  real-time,  time- 
shared  system  with  remote  teletypewriter  input  and  out¬ 
put  units.  Figures  are  not  available  on  the  time  required 
to  complete  the  TIP  project. 

The  body  of  literature  to  be  searched  (journals 
concerned  with  physics)  is  relatively  limited  and  of  a 


..9.1  .  Y*i.s.Iud£9ov  ’’da'll'1  s  bus  t*eolv*fc  4i jqnl  as  a'ieJlrrw 


b  o*nl  bsdsl  irjBid  ad  I£lw  Y*Bi.udi  aov  maw  IIA  .entdc.^n 


IIlw  aJqme3JB  to elA 


yd  bado'ifiss  ed  od  BdBb  lo  Inuo lb  add  ©eBSTO-db  od  sb£f«  ad 
dae  b  9V9lTdda  od  fcsnsleab  esa6o*iq  gnl^aaaos  a  lo  eosem 


lo  lonnsm  adi  nl  9bi/*io  el  mede^e  alrfd  rtai/oridlA 


dl  ^sdlnoaeb  9d  od  emsdaYB  add  od  bs'iBqraoo  noldBiaqo 


. 


aeudOBseBM  dB  dc>9fcopi<i  noldsrnolnl  iBolnriosT  ariT 


dd  lo  sno  sl((fi<?d^I)  Tolers  )  YsoIof<d09T  lo  9diidldenl 


add  bnuo'iB  b9nal8db  al  dl  . sniribsm  beiada^sflil  d  b  no 


■ 


.  •  •.«!  •:  <  i  '  -  !  '  -r  ' d 

slfimuot)  barioiBss  9d  od  9'iudB'idllI  lo  ybod  ad T 
j>  lo  L.tb  badlmil  yl9Vlcf£l9rr  al  (aoiew'q  ddiw  bsmsonco 


55 


technical  nature.  The  journal  title,  volume  number,  page 
number,  article  title  and  author(s)  of  the  article,  as 
well  as  the  bibliographic  information  of  the  article,  such 
as  journal  title  or  source,  volume  number  and  page  number, 
are  keypunched  on  to  cards.  These  input  data  are  edited 
and  placed  on  a  disc  for  access  by  the  computer.  No  ab¬ 
stracting,  reviewing,  or  editing  of  the  source  material 
is  done.  Title  words  and  bibliographic  data  are  relied 
upon  for  retrieval  of  the  articles.  In  this  way,  no 
costly  data  preparation  is  necessary;  however,  a  great 
deal  of  the  useful  information  is  lost. 

The  system  consists  of  three  parts.  The  SEARCH 
command  specifies  the  range  of  journals  to  be  searched, 
such  as  all  of  the  journals,  or  only  the  most  recent  issues 
of  a  journal.  The  FIND  command  determines  what  elements 
are  to  be  searched  for,  such  as  author  name,  a  specific 
word  in  a  title,  a  particular  citation,  or  the  location 
of  the  author,  e.g.,  M.I.T.  The  third  command  specifies 
the  type  of  OUTPUT,  i.e.,  store,  print,  or  count  the 
output . 

Since  each  article  does  not  have  an  elaborate  set 
of  index  terms  associated  with  it,  there  are  only  two 
useful  methods  of  retrieving  relevant  material:  by 
specifying  words  which  may  appear  in  the  title,  or  by 
locating  articles  which  share  a  common  element  such  as 


H0HA33  9fiT  .adisq  99*icid  z3el?.noo  msda^s  9riT 


sriOTEBsa  9d  o'J  alBniuol  lo  agnB'i  srtd  asilJtosqa  bruimoo 


?  1 :  „•  *  '  *•  :1V  '  -n  r.  )0  (  ir^i  9(V  •  tl  uoj  6  lO 


■ 


dap,  ddaiodBla  nB  9V£d  don  aeob  aXoidis  rioss  aonXS 

, 


vd  :Ib in sdfim  insveLet  snivsX'iden  lo  aboritora  Xirtaeu 


^d  *io  teXdId  arid  nl  xseqqB  riolriw  eb*row  gni^lXoaqa 

e:  rio:;a  dnem^Xs  nomr.oo  b  9  a:’,  t  rioXiv  raloidia  ;r.  ’  * f-doI 


56 


author  or  citation.  In  order  to  facilitate  this,  a  set 
of  routines  under  the  name  SHARE  have  been  written.  The 
most  helpful  elements  which  are  shared  are  the  citations. 
This  type  of  sharing  is  termed  "bibliographic  coupling". 
Earlier  publications  (Kessler  (1963a  and  1963b))  discuss 
the  use  of  this  criterion  to  classify  documents  into 
related  groups.  If  two  documents  cite  a  common  reference, 
it  is  assumed  that  they  deal  with  related  material.  Tests 
were  performed  (Kessler  (1965b))  which  indicate  that,  in 
a  narrow  technical  field,  bibliographic  coupling  is  a 
reasonable  method  of  linking  documents. 

In  order  to  carry  out  a  full  search,  a  preliminary 
search  may  be  made  using  a  word  contained  in  the  title. 

It  returns  a  few  documents  which  may  deal  with  the  desired 
topic.  By  then  requesting  a  search  for  articles  which 
share  citations  with  these  documents,  most  relevant 
material  may  be  retrieved.  However,  at  least  two  complete 
scans  of  the  file  of  documents  are  necessary.  No  attempt 
is  made  to  classify  items  to  decrease  the  amount  of  material 
searched . 

The  concept  of  bibliographic  coupling  also  aids  in 
browsing.  By  making  a  preliminary  search  for  a  title  word 
the  user  can  limit  the  amount  of  material  scanned.  By  then 
requesting  articles  that  share  citations,  he  can  thread  his 
way  through  most  of  the  material  in  a  narrow  field. 


1 

xefll  ■'rtlenq  £  iCioisGZi  Llvl  &  iro  ^mcteo  o:  «X9b*xa  nl 

issb  eriJ.tltfJtw  Ib©5  ^£fji  ri'Jtrlw  c  n9.-iuioob  k-v  s  amu^i  ?I 
rtoiriw  aeloX^rtB  no1!  daisse  £  ani^eaupsT  necto 

^nflvale'i  Jaora  .e^nt'njDob  ©aarfc*  ri^lw  anol^B^io  9'iBris 

^qm9-t^£  oM  .^TBaaaosn  9‘ie  a;?ff©muoob  lo’sljtl  9rfcf  lo  ensoa 

1VI  rl0  J  UK  id  3  l  if '  9  -  t  fc  '..*■'  Yll£  3Sl9;  O*  9b£flf  fii 

OTpii 

.  can;  d  . .intern  lo  3rnjom£  eiJ  3lm ll  nso  /T^etf 

X 


57 


Bibliographic  coupling  is  well  suited  to  mechaniza¬ 
tion.  Since  each  article  is  assigned  a  unique  numeric 
name,  searches  are  easily  implemented.  No  problems  arise 
concerning  the  meaning  of  words.  On  the  other  hand,  much 
of  the  information  that  could  be  coded  is  lost  by  retaining 
only  the  identifying  and  bibliographic  data.  Any  paper 
that  deals  with  a  new  subject  has  very  few  citations  to 
link  it  to  related  material.  The  bibliographic  coupling 
approach  appears  to  be  satisfactory  for  a  very  narrow  field, 
such  as  subfields  of  physics,  but  it  is  unsatisfactory  for 
application  to  a  broad  spectrum  of  knowledge. 

The  Technical  Information  Project  has  met  with 
enthusiastic  use  at  M.I.T.  Brown  (1966)  describes  a  pro¬ 
ject  which  is  well  suited  for  the  system.  In  order  to 
update  his  publication  Basic  Data  of  Plasma  with  more  recent 
data,  a  search  is  made  of  all  entries  which  cite  relevant 
articles,  and  the  user  is  notified.  These  articles  are 
then  reviewed  and  new  data  are  added  to  his  book. 

An  area  which  is  just  beginning  to  be  explored,  and 
which  is  a  direct  result  of  time-shared  computing,  is  that 
of  using  the  computer  as  a  communication  device.  The 
Selective  Dissemination  of  Information  then  becomes  a  minor 
function  of  the  system.  As  an  article  is  entered  into  the 
system,  all  user  interest  profiles  are  scanned  to  determine 


if  the  article  is  of  interest  to  them. 


If  so,  it  is  stored 


-aslnfidoarn  od  >*3lv a  XI^w  ?.t  gnllquoo  o  inqBisoXX  iX3 

doum  tbnBd  i9dd o  9dd  nO  ,  .einow  Jo  ^fitnBem  9dd  snXmetSnoo 

gnXfil/>d9'x  ^d  daoi  al  bsboo  9d  bXuoo  dsdd  nos  dfinrio^nl  9dd  ’'o 

od  anoidsdio  wsl  vt:9V  aarf  dostdus  w an  a  ddXw  aX89b  dB^td 

. 

.  .-j  end  nol  bed  i&a  Slew  at  no.'  dost 

dneo9T  »*xom  rfdXw  a  are  a  I 


baB  tb9*xoXqxe  ad  od  ?nt  ml^sd  dwi;  ei  doXdv  flWfl  rtA' 
arid  a  I  ^gniduainoo  ba-iade-smid  lo  dXjja-9*i  doexib  s  ai  rioidw 

.  ©oiv9b  noXdBolnxjflimc  i  s  as  teJuqnioo  arid  gnlau  Jo 

' 

9rid  odni  b9*i9dn9  aX  9ioXdTB  na  eA  .made^a  9dd  Jo  rtoXdomrt 
od  bennaoe  eis  89XXloo:q  de9T©dnX  *i98iJ  XXb  «ni9d3'{8 
.Bfadd  od  destsdrii  lo  el  9loJtd*iB  add  IX 


58 


for  future  printout,  perhaps  on  the  terminal  itself  at 
the  request  of  the  user.  Alternatively,  a  message  may  be 
printed  in  a  form  suitable  for  mailing  to  the  user.  Other 
users'  files  or  interest  profiles  may  be  scanned  to  deter¬ 
mine  whether  they  are  interested  in  the  same  field  as  the 
searcher.  Time-sharing  is  introducing  an  entirely  new 
concept  into  information  retrieval  and  it  has  a  great  deal 
of  potential  value.  However,  much  research  is  still  needed 
to  determine  the  best  ways  of  utilizing  the  new  capabilities. 

4.2.3  The  SMART  System 

The  SMART  information  retrieval  system  (Salton  (1964) 
and  Salton  and  Lesk  ( 1 9 6 5 ) )  is  a  computerized  system  which 
is  designed  to  take  full  advantage  of  user  interaction  with 
the  machine.  It  is  an  iterative  system  which  allows  the 
user  to  specify  a  search  request,  analyze  the  output,  and 
repeatedly  respecify  his  request  or  the  search  mode  until 
his  needs  are  fulfilled.  It  is  an  experimental  system 
designed  for,  but  not  yet  implemented  on,  a  time-shared 
computer.  It  can  also  be  used  as  a  vehicle  for  testing 
various  storage,  analysis,  and  search  strategies  on  the 
full  text  of  documents.  The  SMART  system  is  a  result  of 
an  estimated  two  to  three  years  work.  Accurate  estimates 
are  difficult  since  the  system  is  continually  evolving. 

The  system  is  designed  such  that,  by  repeated 


wen  \Ltit dn.fi  ns  $cilwbon*ttl  a i  sniTBria-onlT  .'tarto'XBea 


■ 


v8  THAM2  srtT  £.S.H 
— 


(AdQI)  notflsB)  me^?v,s  iBveii^e'i  ciol tBBnotnl  THAMfc  erfT 


erU  awolla  rioldw  metfe^s  evi^Bie}!  ns  al  cfl 


Jbna  ertct  es^Isnfi  tJa eupei  rio*xs9e  b  ^.toeqe  o3  nseiJ 


I£3nu  ebom  rtoiaea  srf3  io  ^eeupen  a.trl  v^tlo«qa9'i  xXb9$B*H9n 


■ 

' 


.allow  8TB9Y  ee-rfi3  otf  owtf  be;tami:cfEe  ns 


•  Snlvlove  ^IlBunJ * no o  at  ms^exa  eri3  ^onJta  tLuot'lllb  o'ib 
becJ69qeti  rfoua  bsngiasb  si  01938^8  eriT 


59 


specification  of  many  variables,  the  user  retains  a  great 
deal  of  control  over  the  system*  After  determining  which 
analysis  procedures  (which  will  be  discussed  below)  are 
desired,  the  user  formulates  a  search  request  in  full 
English  sentences,  with  no  prior  coding*  His  request, 
along  with  the  full  text  of  all  documents  in  the  collec¬ 
tion,  is  analyzed  according  to  the  specified  analysis 
procedures,  and  is  correlated  with  the  documents;  the 
highly  correlated  documents  are  returned  to  him.  If  the 
results  are  unsatisfactory,  the  user  can  either  reformulate 
his  request,  or  can  change  the  mode  of  analysis,  or  both, 
and  rerequest  a  search.  This  procedure  continues  until 
the  user  is  satisfied. 

The  analysis  system  consists  of  a  supervisor,  called 
CHIEF,  which  can  call  on  various  processing  subroutines. 

CHIEF  can  accept  eight  input  instructions  which  specify  the 
type  of  processing  to  be  done  and  also  35  control  options 
for  the  various  processing  instructions.  These  instructions 
control  the  type  of  analysis  to  be  carried  out  on  the  document. 

The  entire  text  of  the  document  is  input  to  the  system 
with  no  prior  editing  or  indexing.  Since  no  processing  is 
done  on  the  input  data,  the  analysis  procedures,  which  have 
been  reviewed  (Salton  ( 1 9 6 3 ) )  and  evaluated  (Salton  (1965)), 
are  the  heart  of  the  SMART  system.  The  entire  structure  of 
the  system  depends  on  these  procedures.  A  request  for  a 


1 


. bsXlsXdBa  aJt  9rid 

. eanXdwondus  snIsesocTq  auoiifiv  no  X-'bo  ns o  rfoJtriw  t'T3  HO 
X  X  eqs  riolriw  anoljcu'ij  int  Juqnl  dd£-t®  Jqsocr  n so  131  HO 
?. /toldqo  londnoo  2£  03XB  bns  3  'ob  ad  od  sniiaaaoo'iq  lo  sq\c3 


.  dnf>rm/oofo  arid  no  duo  bslnnso  sd  od  alsy^ns  ':.o  aqyd  ,®rfd  ioidnoo 

. 


gniaeeoo'iq  on  donX3  ♦gnlxebrtl  io  anidlba  iQlnq  on  ridlw 


t((ed9X)  nodlfiS)  bedsulfivs  bn*  <(£dC'I)  no3lBS)  bswaivan  naad 


■ 

desert  arid  ©n£ 


60 


document  can  be  considered  as  nothing  more  than  another 
document,  to  be  analyzed  by  the  same  procedures  used  to 
analyze  the  text  of  the  documents . 

The  alphabetic  dictionary  lookup  procedure  is  an 
essential  step  in  the  analysis  and  consists  of  normalizing 
the  vocabulary  used  in  the  text.  Every  word  is  scanned, 
and  the  high  frequency  low  information  content  words,  such 
as  "a",  "the",  etc.  are  discarded.  The  remaining  words 
are  separated  into  stems  and  affixes  by  matching  against  a 
prestored  thesaurus  of  possible  words  with  corresponding 
codes.  Every  stem  is  replaced  by  a  concept  code  number, 
and  a  syntactic  code  is  assigned  for  every  stem  and  affix. 
Variations  in  word  spellings  due  to  addition  of  affixes, 
such  as  pluralizing  by  changing  "y"  to  "i"  and  adding  "es", 
are  allowed  for.  If  a  word  cannot  be  located  In  the 
thesaurus  a  notification  is  sent  to  the  user,  and  that  word 
is  disregarded  in  further  processing.  Synonymous  words  are 
assigned  the  same  concept  number. 

At  this  point,  all  words  in  the  text  have  been  analyzed 
and  transformed  into  a  numeric  form  suitable  for  mechanical 
manipulation.  The  dependency  of  the  subsequent  procedures 
upon  vocabulary  is  thus  decreased,  and  synonyms  are  matched; 
also,  the  meaning  of  specific  words  is  broadened  through 
synonymous  concept  numbers. 

In  the  procedures  to  be  described  it  is  sometimes 


•jarldonB  nerid  9*iom  ^nirtton  a*  baTsblanoo  9d  nso  frtsmuoob 


.bebasoalb  9tb  .o$9  <" •rid*"  ,  nBr  as 


snlbncqss'noo  ddlw  eb'tow  sldiaaoq  lo  sunu&seciJ  b^oteanq 


«T9dnu/n  sb o d  Xqsonoo  &  ^d  bsojelqsrr  al  rn9d8  ijiara  . 89boo 


. xlllB  bns  msda  ^tsv9  *iol  bansIasB  el  eboo  ollofidn^a  b  bns 


,89X11*13  lo  noldlbbfi  od  sub  asnllle qa  bnow  nl  anoIdBlifiV 


9 rid1  nl  beJ$3oL  sd  Jonn&o  t'lbw  £  II  .nol  b9woIl£  9i£ 


eb'iow  auoflTYnon^S  .gnlssHOO'iq  lerf^xul  nl  beb^ess^alb  al 


.isdfuun  Sq^onoo  etnse  9rid  banslssfl 


>n£ 


fbSDO'ic  dnaupssdue  arid  o  ^onobneqsb  srT.’  . noldBluqln -*n 


i  -  (o  •j‘16  £  ^rton^a  br:.  tbee se-ioeb  aurf.l  si  xrrfiludBoov  rroqu 


:  .-||  _ _ 


asmldanioa  al  dl  bsdlToesb  9d  od  senubeoonq  srid  nl 


6l 


desirable  to  retain  the  original  input  text  words.  To 
facilitate  this,  a  so-called  "vacuous"  dictionary  can  be 
constructed  by  assigning  dummy  concept  numbers  to  every 
new  word  encountered  in  the  text.  Thus,  the  procedures 
described  above  can  be  used  for  analysis,  and  every  word 
in  the  input  text  can  retain  its  uniqueness. 

An  optional  method  for  expanding  a  document’s  concept 
is  by  consultation  of  a  hierarchy  of  concepts  which  can  be 
thought  of  as  replacing  a  library’s  classification  system. 

The  hierarchy  consists  of  a  treelike  structure  representing 
relationships  between  concepts,  with  each  node  corresponding 
to  a  concept  number.  Though  not  stated  explicitly,  it 
appears  as  if  the  hierarchy  is  constructed  manually.  As  one 
moves  up  the  tree,  a  generically  superior  (father)  concept 
can  be  referenced,  and  if  one  moves  down  the  tree,  a  generic- 
ally  inferior  (son)  concept  can  be  obtained.  Those  concepts 
on  the  same  level  (brothers)  can  be  accessed,  and  concepts 
can  be  cross  referenced,  i.e.,  one  can  enter  at  another 
related  node  from  anywhere  in  the  tree  (see  Figure  4.2.1). 
List  processing  techniques  are  used  to  process  this  tree  as 
well  as  a  tree  of  concept  numbers,  which  are  used  as  entry 
points  to  any  node  in  the  tree. 

From  the  alphabetic  analysis,  we  have  the  concept 
numbers  associated  with  each  sentence.  By  sorting  on  concept 
number,  the  frequency  of  occurrence  of  each  concept  number. 


' 


■  ;j  * .  •  ’ 


.  { w  i  i  qf  onp  o  1  o  vri  o ■  at® i  *  -  x  V  r .  U  “  L 


±  .  r  '  t -  ■ ;  -  ■  •  •  .  :  v-  p.yv '  n 


T, 


I  _>  .d  ,  \  >  ~  •:  t  V.I  i.-: 

. 


. 

■ 


■ 

■ 

Sqsono 


. 


62 


-  Cross  Reference 


Figure  4 . 2 . 1 
Concept  Hierarchy 


XoTcfnoO 


■ 


63 


along  with  its  corresponding  sentence  number  can  be  calcu¬ 
lated.  From  this,  a  concept-sentence  incidence  matrix  can 
be  constructed  (see  Figure  4.2.2),  with  element  ij  equal 
to  n  if  and  only  if  concept  i  occurs  exactly  n  times 
in  sentence  j .  To  calculate  a  measure  of  similarity  be¬ 
tween  concepts,  every  row  of  this  matrix  can  be  correlated 
with  every  other  row  by  one  of  three  measures : 


(4.1)  Cosine 


m 

y  a .  b . 

i=i 1 1 


ab 


m  0  m  0 

I  a  I  b 

i=l  1  i=l  1 


(4.2)  Overlap 


m 


l  min (a, ,b, ) 
i=l  1  1 


ab 


m  m 

min  I  a  ,  l  b 

i=l  1  i=l  1 


(4.3)  Asymmetric 


ab 


m 

J  min(a. ,b . ) 
i  =  l 

m 

I  a! 

i=l 


ba 


rn 

l  min (a, ,b  ) 
i=  1_ 1 

m 

I 

1=1 


.n  I  ae 


qBlisvO  (S.»0 


:  '  ■  •  I'  : 


(  cf,  ,.8)nttn 


m 


^Mt]dSa2  ^DMO^OO 


64 


SENTENCE  NUMBER 


1. 

2* 

3. 

-  —  -  — 

m 

1. 

0 

3 

2 

0 

2* 

4 

1 

0 

1 

3. 

0 

0 

2 

7 

— 

P 

6 

2 

0 

0 

Figure  4*2.2 
Concept  Sentence  Matrix 


65 


If  we  compute  the  correlation  of  all  concept  pairs, 
a  concept-concept  matrix  of  correlation  measures  can  be 
constructed.  This  measures  the  strength  of  association 
between  two  concepts  within  a  sentence. 

There  are  two  reasons  for  measuring  concurrence  over 
the  range  of  a  sentence.  A  group  of  concepts  which  corre¬ 
lates  highly  within  a  sentence  are,  in  all  probability,  a 
single  concept  and  can  be  replaced  by  a  single  concept 
number.  Also,  sentences  can  be  ranked  by  significance,  and 
only  the  high  ranking  sentences  need  be  used  for  further 
analysis;  or,  they  can  be  used  to  produce  an  autoabstract, 
in  a  manner  similar  to  the  approach  used  by  Luhn  (1958). 

There  are  two  methods  of  producing  a  single  concept 
from  two  or  more  highly  correlated  concepts.  Firstly,  a 
dictionary  of  statistical  phrases,  consisting  of  concept 
pairs  with  no  semantic  coupling  can  be  consulted,  and  the 
new  concept  number  can  replace  all  occurrences  of  the  old 
concept  numbers  in  each  sentence.  Secondly,  concepts  can 
be  clustered.  A  single  concept  can  be  chosen,  and  a  second 
concept  which  correlates  highly  with  it  can  be  added  to  the 
cluster.  A  third  concept  which  correlates  highly  with  both 
terms  can  be  added  to  the  cluster,  and  so  on.  This  cluster 
can  then  be  replaced  by  a  single  concept  number. 

Syntactic  processing  permits  a  refinement  of  reduced 
requests  and  documents  by  retaining  some  of  the  syntactic 
relationships  of  the  document.  The  significant  sentences 


' 

ti  ir  oo  v. i  i i y  9fior  -O  v:  1  tr.oil 

.  ssnsSnse  rtOB©  nt  eTsdnttm  dqeonoo 


■ 


b  bn jb  tn©30rio  ©d  nso  dasortoo  ©Isnie  A 

' 

rtdod  ridlw  ^Iriglri  addslsvroo  doifiw  3q9onoc>  toldt  A  .iddaulo 

.no  os  bcie  ,'i9j8;iIo  oj  bsbbfi  9d  nso  anrisJ 
.et9imun  dq^onoo  ©Ignis  a  ^d  bsoBlqen  ed  nerid  hbo 

dnf^msnJ  S:9r'i  b  eSiiriaq  anleaeoo'iq,  ol30B$n'tZ 
srid  lo  oisoa  gninl£^©'i  \,i  adnamuoob  bna  edasupsn 


aoonddnea  insollln;  U  adT  .  dr!  >rnjjoo6  ©rid  lo  aqlrieno/ -is  >  ©n 


66 


of  both  the  documents  and  requests,  as  described  above, 
may  be  used  for  processing.  With  this  mode  of  analysis, 
terms  or  concepts  may  be  clustered  if  and  only  if  the 
syntactic  relationships  are  identical.  A  subroutine  has 
been  designed  to  input  the  stem  and  affixes  of  the  signifi¬ 
cant  sentences,  and  to  output  the  sentences  in  a  syntactic 
tree  form.  This  tree  is  then  compared  to  a  prestored  tree, 
called  the  "criterion  phrases"  dictionary,  which  contains 
information  about  the  syntactic  relationships  between  con¬ 
cepts.  If  a  sentence,  or  parts  of  a  sentence,  match  the 
criterion  phrases  tree,  it  can  be  expanded  by  the  addition 
of  further  concepts. 

Up  to  this  point,  documents  have  been  characterized 
by  concept  numbers  which  may  have  been  expanded  by  hierar¬ 
chical,  syntactical  phrase  or  criterion  phrase  analysis  and 
by  a  concept-sentence  concurrence  matrix.  The  analysis  may 
be  supplemented  further  by  constructing  a  concept-document 
matrix,  similar  to  the  concept-sentence  matrix,  except  that 
the  entire  document,  not  just  sentences,  is  used.  This 
matrix  can  be  subjected  to  row  correlation  methods,  as  was 
the  concept-sentence  matrix,  and  the  concept  clusters 
determined  by  highly  correlated  concepts.  Any  concept  in 
the  cluster  may  then  be  replaced  by  a  single  concept  repre¬ 
senting  the  entire  cluster. 

By  using  this  matrix,  and  correlating  its  columns 


-wd?  i  eql  .1  -  .io  1  s.  s'!  o.  oadfiv  n1  ;U/ocfB  no  I ^  n  ^1or 


'loliedJt'io 


©rid  bsbnaqx©  ed  neo  di  ,©yxd  2©3Biriq  nol*i9d2no 


■**  * 3 


sdqsonoo  'isridiul  lo 


r  3  evsri  a'.  rt-'fr  '?oo  ,  xi jt<  q  a  . -  '  od  cU 


-'xs.ifilrf  vd  bsbnBqx©  need  ©vari  jam  riclriw  artedmun  dqsonoo  ^d 


^BiTi  ela^Ians  ©dT 


dnsmuoob-dqgonoo  a  arUdoxndertQO  ^d  4x©dJ,iul  bsdcr©in9lqqi/3  ed 

.  •  -  ,.  '-<  ir: 


' 


. edqsoaoo  b©dBle*noo  ^IrigJtrf  ^d  5©nlrm©d©b 


67 


certain  documents  and  prior  requests  may  be  clustered. 

Documents  which  then  correlate  highly  with  the  request 
vector  are  returned  as  satisfying  the  request.  The  request 
vector  may  have  its  terms  weighted  by  the  user,  in  addition 
to  being  expanded  by  the  above  analysis  procedures.  Thus, 
the  user  may  retain  control  over  the  terms  used  in  the 
search.  The  user  can  vary  four  variables  which  affect  the 
search.  He  can  specify  the  correlation  measure  used,  he 
can  change  the  number  of  documents  output  by  changing  the 
correlation  threshold  value,  and  he  can  vary  the  kinds  of 
documents  output  either  by  varying  the  analysis  procedures 
used  or  by  respecifying  the  search  request. 

The  SMART  system  is  one  of  the  more  advanced  experi¬ 
mental  information  storage  and  retrieval  systems  in  operation 
today.  Combinations  of  analysis  procedures  resulting  in 
relevant  documents  being  returned,  and  discovery  of  optimum 
methods  of  dealing  with  man-machine  interaction,  should 
result  from  this  experiment. 

4.2.4  The  BOLD  System 

The  BOLD  (Bibliographic  On-Line  Display)  system  described 
by  Burnaugh  (1966)  is  a  real-time  on-line  experimental  system 
which  is  designed  as  a  vehicle  for  research  as  well  as  infor¬ 
mation  storage  and  retrieval.  It  is  operational  on  a  time- 
shared  machine  which  uses  teletypewriters  and  cathode  ray 


tsiaqo  nl  sm9^ay3  iBvel'itfs'i  bn &  e^BVioXe  noi  itsm'solnl  IBiffisin 


*  *  ^ 


■v  .  ■ .  _c  .  >a  s'tix  [  •■; .* 


•re  f  ns  ffl  »  ft  K7  —  rrTl  n  ►  r<vr  RTran  h  f  r"f  *1  fF.7{")fl 


(^BlqeJtQ  snlJ-nO  oirfq£i$olldl8)  CK308  srtT 


. 

-'loin!  8B  Hew 


es  rioTCBSesi  10*1  elolrtev  b  es  bensle9b  el  rioiriw 


^8i  ebort^Bo  £ns  e'te^l'iweq^sl  ©3  bumij  riol.lw  snlrfosm  beaBria 


68 


tubes  (CRT)  as  terminal  devices.  The  use  of  CRT  devices 
introduces  new  possibilities  into  the  information  storage 
and  retrieval  cycle.  Since  the  time  required  to  output 
a  page  of  information  on  the  CRT  is  very  small,  it  can  be 
used  as  an  interrogation  device.  Upon  finding  information 
which  is  required  in  hard  copy  form  (printed  on  paper)  for 
future  reference,  the  information  on  the  CRT  can  be  trans¬ 
ferred  to  the  teletypewriters  or  to  magnetic  tape  for  off¬ 
line  listing.  Estimates  of  development  and  implementation 
times  are  not  available  for  the  BOLD  system. 

The  BOLD  system  is  divided  into  two  sections:  retrieval 
and  data  base  generation.  The  data  base  generator  creates  a 
data  base  from  input  data,  which  can  be  in  almost  any  form. 

In  the  illustration  discussed,  the  data  base  consists  of 
bibliographic  items  containing  information,  referred  to  as 
"descriptors",  such  as  title,  author,  accession  number,  and 
the  indexing  terms  of  a  document.  Each  entry  contains 
terms  subordinate  to  the  descriptors  which  describe  the 
entry,  e.g.,  ^author  /  Jones,  R.T.  //.  A  table  of  these 
terms  is  constructed,  with  a  single  address  pointing  to  a 
data  entry  using  this  descriptor.  Within  this  data  entry 
there  is  an  address  pointing  to  another  data  entry  using 
the  same  descriptor  and  so  on,  until  the  last  data  entry 
points  to  the  first  data  entry,  thus  forming  a  loop.  Every 
document  must  be  manually  indexed:  the  data  base  generator 


acfBienoo  5s.cd  jsctr.b  tb9aE£oslb  no!3  r^idsul  Cl  erU  fll 


ee  03  bens'isi  ,aot JAimolni  snlniBdnoo  smsdl  oldqB'isolidld 


. diwituoo/'  £  io  ennsi  s-^lxe onx^srld 


dl'roeeb  (iolci*  e-ioJqlT&aet  erif  odr  sdanlbiodiie  smidd 

eldBo  -4  .\\  ,T.B  ,  asnol  \  -toddi/fi*  ti.j.0  t\'i3n9 

. 

V 

.  io;  1  •  »c  c  1  r! :  snisu  sdfib 


:l':  *A3bn  ^llBLneir  ®d  deuiB  Jn9nujDob 


69 


is  only  a  convenient  method  for  storing  bibliographic 
items . 

In  order  to  transform  a  natural  language  request 
into  the  language  utilized  by  the  data  base  In  the  BOLD 
system,  a  dictionary  of  authorized  terms  is  constructed. 

The  basic  entity  In  the  dictionary  is  the  descriptor,  under 
which  all  entries  are  defined.  This  dictionary  consists  of 
a  hierarchical  structure  that  represents  a  classification 
of  every  term  in  the  data  base.  The  user  has  a  high  degree 
of  control  over  this  dictionary;  he  can  add,  delete  and 
replace  terms  and  change  the  relationship  of  terms  already 
existing  in  the  system  to  suit  his  own  needs.  Every  user 
can  interrogate  the  dictionary  to  learn  the  words  authorized 
for  a  search,  and  he  can  then  form  a  suitable  request.  For 
example,  to  locate  all  words  of  the  root  HEAT  the  user  would 
type  .HEAT  .  The  system  would  then  return  a  count  of  the 
entries  under  all  root  forms  of  HEATP  and  output  information 
similar  to  the  following: 

6  entries  are  ref’d  by  heat 

1  entries  are  ref’d  by  heaters 

2  entries  are  ref’d  by  heating. 

Other  interrogation  instructions  to  the  dictionary, 
for  example  requests  for  equivalent  retrieval  terms,  can  be 
used.  After  determining  terms  which  are  authorized  for  the 


• 


■ 


-•  n»i  ■  r.'j  :  .1  •••t  :*  ■  ?  i e.i  ‘0 


70 


system,  the  user  may  formulate  a  request  by  connecting 
retrieval  terms  by  Boolean  operators,  e.g», 

.HEATERS  and  LIGHTING,  or 

. *AUTH0R  =  JOHNSON,  R.  D.  and  *TITLE  =  HEATERS. 

In  order  to  retrieve  documents,  a  man-machine  dialogue 
is  initiated.  The  system  begins  by  listing  all  categories 
of  the  data  base  on  the  CRT.  Using  a  light  pen,  the  user 
selects  a  category  to  which  the  system  responds  by  flashing 
all  subcategories .  The  user  continues  to  work  his  way 
through  a  classification  tree  until  retrieval  terms  which 
can  be  used  in  a  search  request  are  presented.  He  then 
selects  one  of  two  modes.  Search  or  Browse.  In  the  Search 
mode,  a  request  is  formulated  by  use  of  terms  plus  Boolean 
operators,  and  all  document  identifiers  satisfying  the 
request,  ranked  by  the  number  of  retrieval  terms  it  con¬ 
tains,  is  returned.  The  user  may  then  delete  any  document, 
and  the  next  document  in  the  list  is  presented.  The  abstract 
of  any  document  may  be  requested  at  any  time  to  determine  if 
the  article  is,  in  fact,  relevant. 

In  Browse  mode,  any  descriptor  may  be  used.  The  browse 
may  be  limited  to  specific  categories,  or  may  range  over  the 
entire  data  base.  The  user  can  browse  his  way  through  docu¬ 
ments  retrieved  by  this  method,  and  Inspect  interesting 


' 


,  'i  i  /  .  ‘  i  q  . ;;  r  ird  j  .v-  3  <9Up**t 


,  ;  .•  '»  sitt 


.  *  t  bo  r  e:  wo 

. 

■ 


71 


abstracts  at  will.  Terms  in  a  request  can  be  deleted,  and 
output  can  be  transferred  from  the  CRT  to  an  off-line  device. 
The  dictionary  items  can  be  manipulated  by  adding,  deleting, 
and  replacing  terms  as  well  as  changing  relationships  be¬ 
tween  existing  terms. 

The  design  of  the  BOLD  system  allows  use  of  many 
dictionaries  biased  in  favor  of  each  user,  and  all  relating 
to  a  single  data  base.  Thus,  it  is  possible  for  a  user  to 
design  his  own  basis  of  communication  with  the  machine  by 
an  individual  vocabulary  and  classification  scheme. 


. 


. 


CHAPTER  V 


THE  SARA  SYSTEM 


5 . 0  Introduction 

This  chapter  will  describe  an  information  storage  and 
retrieval  system  developed  on,  and  for,  a  remote-access, 
real-time,  time-shared  computing  system.  Although  the 
system  is  intended  primarily  for  applications  in  the  field 
of  history,  it  could  also  be  applied  to  other  fields.  The 
system,  called  SARA  (Storage  And  Retrieval  Alberta) ,  has 
been  developed  with  two  objectives  in  mind„  First,  it 
should  be  sufficiently  general  to  be  of  use  in  more  than 
a  single  discipline;  second,  for  reasons  of  efficiency  the 
use  to  which  the  system  is  to  be  put  is  taken  into  account. 
For  example,  a  general  method  for  dealing  with  index  terms 
is  used.  If  an  index  term  has  been  entered  in  the  thesaurus 
it  may  be  used  as  an  index  term;  there  are  no  restrictions 
on  the  length  of  the  term,  the  meaning  of  the  term,  and  so 
on.  On  the  other  hand,  an  efficient  and  less  general  method 
is  incorporated  into  the  system  because  of  the  dependence  of 
history  on  dates. 

Special  attention  is  paid  to  storage  methods  and  infor 
mation  manipulation.  Peripheral  devices  are  not  available 
because  of  present  restrictions  on  the  programming  language. 
However,  an  efficient  method  of  storing  large  masses  of  data 


on  a  random  access  device  is  simulated. 


id  oJ  at  m%Je\  :  -t 


\ 


■ 


73 


The  selection  of  index  terms  for  the  documents  was 
carried  out  in  collaboration  with  the  Department  of  History. 
Guidelines  were  laid  out  to  indicate  the  approach  to  be  taken 
in  selecting  the  index  terms  in  order  that  the  selection  re¬ 
main  compatible  with  the  computer.  Appendix  A  discusses  the 
approach  taken  and  some  of  the  problems  encountered. 

A  brief  description  of  the  programming  language  used 
and  the  environment  in  which  the  system  operates  introduces 
the  next  section.  A  description  of  the  SARA  system  from  the 
user's  point  of  view  and  of  the  internal  workings  of  the 
system  concludes  the  chapter. 

5 . 1  A  General  Description  of  the  SARA  System 

The  SARA  system  consists  of  two  sections:  the  selection 
and  indexing  of  the  documents,  and  the  mechanized  storage  and 
retrieval  part  which  manipulates  the  indexed  documents.  The 
first  section  is  under  the  control  of  the  history  participant, 
and  all  decisions  concerning  the  choice  of  index  terms,  their 
interrelations  and  the  documents  to  be  indexed  are  his.  The 
author  had  little  control  over  these  decisions;  the  only  con¬ 
trol  retained  was  that  concerning  compatibility  of  index  terms 
to  the  computer.  Appendix  A  gives  a  more  complete  description 
of  this  part  of  the  SARA  system. 

The  second  part  of  SARA  deals  with  the  storage  and 
retrieval  of  the  indexed  documents.  It  is  written  in  a  new 
programming  language,  called  APL  (see  Iverson  (1962)  or 


f 

slriflriosin  9rt*  fens  te^«3muocfa  *ciJ 


j 

■  •  i?  -i  J  01  m 

. 

i  i  vi  i  o  :*%&c  noo 

- 


74 


Falkoff  and  Iverson  (1966))  for  time-shared  computer  hard¬ 
ware.  Following  is  a  full  description  of  the  mechanized 
portion. 

5.1.1  The  Hardware  Environment  and  the  Programming  Language 
The  system  is  designed  around  an  IBM  System  360  Model 
67  computer  with  IBM  2741  remote  access  terminals.  This  is 
a  large  computing  system  capable  of  servicing  many  users  con¬ 
currently.  Hardware  of  this  type  will  be  vital  to  the  effic¬ 
ient  operation  of  future  information  storage  and  retrieval 
systems.  Many  users  may  utilize  the  hardware  by  searching 
a  common  data  base  with  different  search  requests,  and, 
hence,  share  the  cost  of  operation  of  the  system.  The  BOLD 
system  already  described  foreshadows  such  use.  The  system 
utilizes  only  the  hardware  mentioned  above,  since  no  input 
or  output  commands  exist  at  present  for  peripheral  devices. 
All  communications  between  the  machine  and  the  user  is  via 
the  terminals.  It  is  understood  that  this  restriction  will 
be  removed  shortly.  Input  and  output  commands  may,  for 
example,  refer  to  magnetic  tape  or  discs.  The  APL  system, 
used  to  develop  this  portion  of  the  SARA  system,  utilizes 
the  equipment  mentioned  above,  as  well  as  some  large  capacity 
random  access  storage  such  as  disc  or  drum.  Such  random 
access  devices  will  also  be  necessary  for  any  large  scale 
information  retrieval  system. 


' 


; 


d$  oesvvj  >c  enc±7Bc  snummoi  IXA 


. 


3  0  03 


75 


APL  is  specified  so  that  it  is  oriented  toward  on-line 
communications  between  man  and  machine .  It  is  well  suited 
to  man-machine  communication  since  its  set  of'  operators  is 
very  precise  and  very  powerful.  This  power,  however,  is 
restricted  primarily  to  Its  arithmetical  and  logical  cap¬ 
abilities.  For  example,  arrays  can  be  handled  simply  and 
easily.  The  notation  C  «-  A  +.*  B  multiplies  two  matrices 
of  suitable  dimension,  A  and  5,  and  stores  the  result  in  C. 
Other  capabilities  necessary  for  efficient  general  purpose 
operation  of  the  computer,  such  as  input  and  output  commands 
for  the  use  of  magnetic  tape,  disc  or  cards,  are  notably 
lacking . 

The  programmed  portion  of  the  SARA  system  is  divided 
into  three  subsystems:  the  control  subsystem,  the  storage 
subsystem,  and  the  retrieval  subsystem  vFigure  5.1.1). 

These  are  described  separately.  Block  diagrams  and  listings 
of  all  routines  in  these  subsystems  appear  In  Appendix  B, 
and  an  example  of  the  use  of  the  system  appears  In  Appendix  C. 

5.1.2  The  Control  Subsystem 

The  control  subsystem,  SARACNL,  is  a  program  whose 
function  is  to  Interpret  commands  and  pass  control  to  the 
appropriate  subsystem.  Control  Is  returned  to  the  subsystem 
SARACNL  upon  exit  from  the  storage  and  retrieval  subsystems, 
and  control  is  ultimately  returned  to  the  APL  system  from 
this  subsystem. 


•  » 


<**3***<i*&  icnttvfl  afe  ' •1-'\ 


non!  «•*»«•  MA  »rti  oJ  tid-iuasi  «i*#*i*l*lJ*  ai  oi  no 


76 


Storage 

(STORE) 


Control 
( SARACNL ) 


Retr 


( FIND ) 


Figure  5.1.1 


eval 


(PRINT) 


SARACNL  Program  Hierarchy 


i-i.e 


. 


77 


The  user  signs  on  to  the  AFL  system  in  the  usual  manner 
and  requests  the  control  system  by  typing  SARACNL.  A  full 
description  of  the  operation  of  the  APL  system  is  given  by 
Falkoff  and  Iverson  (1966).  In  this  subsystem,  and  in  all 
subsystems  to  be  described,  the  user  has  the  option  of 
obtaining  a  full  explanation  of  the  operation  of  the  sub¬ 
system  in  control  immediately  upon  entering  the  system. 

This  option  may  be  overridden  upon  entry  to  the  subsystem 
by  typing  NO  after  the  command,  e.g.,  FIND  NO .  After  typing 
00 3  SARACNL  pauses  for  an  input  command:  STORE ,  FIND , 

PRINT  or  END .  The  STORE  command  initiates  the  storage  sub¬ 
system,  and  the  FIND  and  PRINT  are  commands  of  the  retrieval 
subsystem.  After  entering  a  command,  control  is  relinquished 
to  the  proper  subsystem,  and  processing  of  the  function 
entered  begins,  END  passes  control  back  to  the  APL  system. 

5.1.3  The  Storage  Subsystem 

The  storage  subsystem,  accessed  by  typing  STORE  when 
the  main  program,  SARACNL 3  is  in  control,  stores  documents 
on  the  simulated  peripheral  equipment.  Where  applicable, 
terms  are  edited  and  added  to  the  proper  files. 

After  entering  the  storage  subsystem,  the  subsystem 
types  a  system-assigned  document  number,  followed  by  TITLE . 
The  user  enters  the  title  of  the  document,  which  is  placed 
in  the  simulated  peripheral  file  PERMEM .  AUTHOR  is  then 


, 


' 


'  '  I  L  rs  t  .  5  i  i  •  •-  *  1  •  o ; 


. 


. 


fi.il  w  4  Jn  aw:,  sb  lo  sitlcJ^ni  s  :.^nt  *i98u  srlT 


78 


requested.  The  author  must  be  input  in  the  form  SURNAME- 
INITIALS.  The  name  is  edited.  If  it  has  been  used 
previously,  the  document  number  is  stored  in  its  cell  in 
the  matrix  MEM ;  otherwise,  the  surname  is  added  to  the 
list  of  index  terms,  and  a  new  entry  is  created  in  the 
main  file,  MEM .  JOURNAL  is  then  typed.  The  input  expected 
is  of  the  form  JOURNALNAME,  VOL.  XX.  The  name  of  the 
journal  must  be  in  the  list  of  authorized  terms,  LST; 
otherwise,  the  journal  is  rerequested.  The  volume  number 
may  be  omitted  if  desired.  The  word  XTERMS  follows.  One 
index  term  at  a  time  may  be  entered.  Each  is  edited  to 
determine  if  it  is  an  authorized  term.  If  it  is  not  then 
notification  is  given  to  the  user.  Index  terms  may  continue 
to  be  entered  until  the  user  types  END .  As  each  term  is 
entered,  the  current  document  number  is  stored  in  each 
index  term's  cell.  YEARS  appears  after  the  user  types  END . 
The  years  the  document  deals  with  may  be  entered  in  one  of 
four  formats  as  indicated  by  the  following  examples: 

1)  1920 

2)  1920,1922 

3)  1920,1922,1930 

4)  1920-1930. 

The  document  number  is  entered  in  the  years  file,  YMEM .  If 
years  are  not  applicable,  any  character  other  than  a  digit 


srli  r  E  si  w ©a  s  fetus  t  simteJ  x-bcrl  To  3e2X 

.JOV  e3KAi4IAT4flU0I>  anol  art*  lo  el 

.'isbjj  art*  o$  navlg  *2  noXJsoll ^ ion 

.1X90  s'B'Tei  xsbni 


:  .■;■...  ( , 


ocex.ssei.osei 


.  t\  ni  n  ber  *  n- 3  .  i  'tQdnjn  .zmubob  •■  ■" 


79 


may  be  entered  and  no  action  will  be  taken.  The  final  entry 
for  the  document  is  the  ABSTRACT .  If  applicable,  an  abstract 
may  be  entered.  MORE?  is  typed,  and  the  user  replies  with 
YES  or  NO;  if  NO,  control  is  returned  to  SARACNL. 

The  storage  subsystem  is  simple,  but  it  serves  the 
needs  of  this  system.  When  input  and  output  commands  become 
available  to  APL,  a  different  storage  subsystem  will  be  re¬ 
quired.  It  would  be  oriented  to  the  peripheral  equipment 
used  to  store  the  documents,  and  to  the  input  and  output 
commands . 

5.1.^  The  Retrieval  Subsystem 

The  retrieval  subsystem  has  two  commands  available  to 
it:  FIND  and  PRINT .  These  two  commands  will  be  discussed 

individually . 

The  FIND  command  initiates  a  subsystem  which  asks  if 
the  request  will  have  weighted  terms,  to  which  the  user 
replies  YES  or  NO.  Each  index  term  in  the  system  may  have 
a  numeric  factor  associated  with  it.  This  numeric  factor, 
or  weight,  determines  the  relative  importance  of  each  term. 
For  example,  if  index  terms  ECONOMICS ,  RELIGION  ahd  POLITICS 
have  weights  12,  23  and  31,  respectively,  more  importance 
would  be  associated  with  documents  indexed  by  the  term 
POLITICS  than  with  the  other  two  terms.  Thus,  weighting  is 
a  method  of  assigning  relative  importance  to  index  terms  in 


. 


.$i  r.  w  :■>-  '  oc  ,f.  o.l  ismn  b 


.-’Hi  '  i  OW  '1.3 

‘ 


I'.O  ru/.i  ?mJ  xeoni:  *:1  t®IqfitBX3  *io'i 
. ;  ,  t  S  ■  ':V,  3V.t3 


s>  :i  a^n*>  joob  rttf  i  bBJAlooaaB  e<-  uluow 

■ 


■ 


80 


a  request.  For  a  full  explanation  of  weighting  and  Its  use, 
see  Brandhorst  (1966).  Continuing  on  with  the  description 
of  the  FIND  command,  the  subsystem  awaits  a  request  to  be 
entered.  This  request  can  consist  of  index  terms,  or  years, 
or  both,  connected  by  one  of  the  Boolean  operators  v,  a, 

<,  £,  =»  * ,  >,  or  >.  For  example,  a  request  may  consist  of 
ECONOMICS  v  RELIGION  a  (PER  >  1921).  This  requests  documents 
dealing  with  economics  or  religion  on  or  after  the  year  1921. 
The  only  index  term  which  can  be  used  with  a  year  is  PER 
(period).  For  example,  ECONOMICS  >  1921  is  invalid;  PER  >. 
1921  is  not  invalid.  This  request  is  edited  to  insure  that 
all  terms  are  valid  index  terms;  if  they  are  not,  the  user 
is  notified  and  the  request  re-entered  by  the  user.  If  the 
search  terms  are  to  be  weighted,  the  user  is  notified  as 
each  term  requires  weighting,  e.g.,  WEIGHTS?  -  ECONOMICS ; 
he  inputs  a  weighting  factor  for  each  index  term,  a  non¬ 
negative  integer  less  than  100,  e.g.,  62.  After  all  index 
terms  have  been  weighted,  the  user  is  asked  to  specify  the 
type  of  expansion  required  on  the  request.  It  may  be 
expanded  up  (to  include  more  general  terms,  one  for  each 
request  term)  by  typing  GENERAL ;  down  (to  include  more 
specific  terms)  by  typing  SPECIFIC ;  across  (to  include 
related  terms)  by  typing  RELATED ,  or  have  no  expansion  at 
all  by  typing  NONE .  Figure  A.  3  in  Appendix  A  depicts,  in 
graphic  form,  the  hierarchy  of  terms  used.  At  this  point. 


4  93jj  sdl  bne  gnldrislew  lo  noXdBnsXqxs  XIul  s  *io^  .^gsupsT  s 

scf  od  daeupen  b  adlswa  msdeYadus  9rid  tbn£mmoo  QWL1  erid  lo 


,  -  ,  a  ,v  anodBTeqo  naeXooQ  erfd  lo  ©no  Yd  b9doennoo  ,  ridod  *io 


dalsnoo  xjzm  desupen  b  t9XqmBxe  rio?i  .5  -to  «<  ,  *  ,=  «>  ♦  > 


.  IS{?I  tQ9Y  arid  'i9cfls  *io  no  noisXl9T[  eolmonooe  rfdlw  gnlifieb 


ibby  b  rfdXw  b9au  9d  n&o  rfolriw  ansi  xobnJt  ylno  sji T 

dBrfd  9ujanl  od  bsdlb9  el  daeup9'i  eXriT  .bXXBvnX  don  si  XSex 

.'i98Lf  erfd  Yd  ben;9dn9-9*i  d89upe*i  arid  bns  belJU&n  aX 


is  b9llldon  si  erfd  trf9drigXew  ed  od  9ns  anrt9d  rtDTsee 


*  SE’SKSISM  t.3.e  tanXdri3Xew  89iXup9*i  mnad  doss 


-non  b  «nri9d  X9bnl  rfoBe  rtol  lodes!  gnldrigiew  s  aduqnl  erf 


;3 1  ...  i"I  '  S$SCt 


arid  Y'IXoeqe  od  bsrtaB  si  T9Bjj  erfd  tb9drial9w  ne9d  avsrf  anrced 

9d  YB^n  dl  .d89upeT  erfd  no  be'ilupe'i  nolanBqxe  lo  eq^d 

- 

rrcl  t  juried  fsrfeneg  eiom  ebulonl  od)  qu  iqebnsqxe 


e^onr  ebulenl  od)  nwor  jdAftS'Aia  gnlqyd  yd  (mn9d  ds9up9T 


dB  nolsnBqx9  on  evsri  *10  tQlLT'kd,ajl  gnXqYd  Yd  (armed  bedfilei 
nl  t8doXq9Jb  A  xlbneqq A  nl  £  .A  9'iugX'>I  ,SVUd%  gnXqYd  Yd  XIb 
«dnloq  slrid  dA  .beau  annsd  lo  Y^onBielrf  9rid  tfirxol  olrfqBng 


81 


the  system  is  ready  to  initiate  a  search  of  the  data.  The 
user  is  asked  if  a  search  should  be  initiated.  If  YES ,  a 
search  on  the  stored  data  begins;  if  NO s  the  user  is  asked 
to  state  his  request  again.  After  searching  the  entire  file 
of  documents,  the  user  is  notified  by  a  count  of  the  number 
of  documents  satisfying  the  search,  and  is  asked  if  the 
document  numbers  should  be  listed.  If  YES,  five  document 
numbers  are  typed,  and  he  is  asked  if  more  should  be  output, 
and  so  on,  until  all  document  numbers  are  listed.  The  user 
is  given  the  option  of  expanding  the  same  request  in  another 
way  and  re-searching  the  entire  file  (by  typing  SAME),  or 
rephrasing  his  request  (by  typing  YES),  or  exiting  the  sub¬ 
system  (by  typing  END  or  NO),  If  he  exits  from  the  sub¬ 
system,  he  can  enter  the  PRINT  subsystem  to  inspect  the 
documents  retrieved. 

The  PRINT  subsystem  lists,  in  natural  language,  any 
or  all  of  the  descriptive  parts  of  the  specified  documents. 
Any  set  of  the  title,  author,  journal,  index  terms  (including 
the  years  the  document  deals  with)  and  the  abstract  can  be 
inspected.  The  document  to  be  inspected  is  identified  by  its 
document  number.  For  example,  to  inspect  the  title  and 
abstract  of  document  2613,  the  user  would  type 


2613 ,TITLE .ABSTRACT 


ni  deaup9n  5OTB8  arid  gnibfuaqx9  lo  v/iolJqo  arid  navis  bI 


:  •  •  i  e 


ef!n9^  xabnl  ll&nfwol  j'roridiffi  , aldld  a rid  lo  dae 

dsde  arid  bits  (ridiw  silsafc  Jctamuoob  arid  gisey  arid 


' 

' 


82 


Since  only  the  first  two  characters  of  each  option  are 
inspected,  the  user  could  also  type 

2613 ,TI ,AB 

and  the  same  information  would  be  output.  The  system  would 
list  the  title  and  abstract,  and  ask  if  more  documents  are 
to  be  processed.  If  the  user  requires  that  all  parts  of 
the  document  be  typed,  he  enters  2613, ALL.  If  no  more 
documents  are  to  be  processed,  the  user  may  exit  from  the 
subsystem  by  typing  END.  He  is  then  in  a  position  to 
request  more  documents,  or  to  relinquish  control  to  the 
APL  system  by  typing  END  again. 

The  user  has  a  large  degree  of  control  over  the 
system.  He  has  options  available  to  him  as  he  formulates 
his  request,  as  well  as  a  natural  language  interface  with 
the  machine.  If  a  particular  request  does  not  satisfy  his 
needs,  he  can  rephrase  the  request,  or  he  can  expand  it  in 
one  of  three  ways,  i.e.,  make  it  more  general,  more  specific 
or  include  related  terms. 

5 . 2  Details  of  the  Operation  of  the  System 

The  control  subroutine,  SARACNL ,  interprets  the  four 
commands  FIND ,  PRINT,  STORE  and  END,  and  transfers  control 
to  the  appropriate  subsystem.  If  the  command  is  not  one  of 
these  four,  the  command  is  re-requested  by  the  system  re¬ 
typing  GO.  Upon  exiting  from  any  one  of  the  three  subsystems. 


o  aXdBllavj  a  not  Iqo  a/-;! 

9$BurcieI  I.s'iuJBti  &  ae  Slav  ae  t4eeu pem  atci 


,  , 


■ 

S'v'U.tf  t  :n b  ,  boB  .TKlWi  ,Q?.  ebnBtnraoo 

.  nt*>tf  ^adue'^BstaJt'xqo  [q»  s  »ni  od 


83 


control  can  be  transferred  to  any  other  subsystem,  or  control 
can  be  returned  to  the  APL  system  by  typing  END, 

The  FIND  subsystem,  a  major  portion  of  the  retrieval 
subsystem,  will  be  illustrated  by  following  the  steps  of 
transforming  a  natural  language  request  to  reverse  Polish 
numeric  notation  used  to  search  the  file  of  documents.  The 
original  natural  language  request  is  transformed  three  times 
before  it  is  in  reverse  Polish  numeric  notation.  The  original 
request  is  first  transformed  into  a  numeric  vector,  or  string; 
each  component  of  this  string  corresponds  to  either  an  index 
term  or  an  operator.  This  numeric  string  is  then  expanded 
to  include  general,  specific  or  related  terms,  if  requested 
by  the  user.  The  third  transformation  converts  the  expanded 
numeric  string  to  reverse  Polish  numeric  notation.  This 
transformation  will  be  explained  in  detail  later.  It  is 
this  numeric  string  that  is  used  in  the  final  step  of  the 
retrieval,  the  search  procedure. 

Initially,  the  subsystem  asks  if  the  terms  are  to  be 
weighted  by  typing  EQUAL  WEIGHTS? ;  if  YES s  all  terms  have 
equal  weights  of  one.  It  will  be  assumed  in  our  illustration 
that  the  terms  will  be  weighted.  The  following  request  is 
entered . 

( CHURCH  v  STATE)  a  ECONOMICS  a  (PER  >  1650)  a 


( PER  <  1900). 


fierier  si  gntaAB  dl'iemim  .tlrfT  .rto3$*i9qo  nfi  orarre ^ 


' 


■ 


a  (o?c:  ,  \ >  a  a  (sr.^  v  v.w 


84 


This  requests  all  documents  concerning  church  or  state 
economics,  dealing  with  the  period  1650  to  1900,  inclusive. 
The  Boolean  operators  a  (and),  v  (or)  and  ~  (not),  as  well 
as  the  relational  operators  (to  deal  with  years)  are  avail¬ 
able  to  the  system. 

The  open  parenthesis  is  first  encoded  to  its  numeric 
equivalent,  -1.  The  first  index  term,  CHURCH ,  is  checked 
to  determine  if  it  is  an  authorized  index  term  by  consulting 
the  thesaurus  for  valid  index  terms  (LST) .  If  it  is,  a 
unique  non-negative  numeric  code,  called  the  concept  number, 
corresponding  to  the  index  term  is  determined,  and  a  weight 
is  requested  for  the  term.  This  weight  is  combined  with 
the  concept  number  by  dividing  the  weight  by  10^,  and  adding 
it  to  the  concept  number.  Hence,  to  weight  concept  number 

C 

12  by  68,  we  would  have  12  +  (68  /  10°)  =  12.000068.  This 
is  entered  in  the  request  vector.  The  next  element,  the 
v  (or)  operator,  is  coded  to  its  numeric  equivalent,  -2. 

Note  that  all  operators  and  parentheses  have  numeric  codes 
less  than  zero.  This  process  continues  until  the  (numeric) 

4 

years  are  encountered.  These  are  divided  by  10  and 
entered  in  the  coded  string  as  quantities  between  zero 
and  one.  Hence,  the  year  1921  would  have  a  numeric  code 
of  0.1921.  After  coding,  the  sample  request  would  be 


transformed  into  the  numeric  vector 


.  tmed  srfd  nol  be:tasxjp9,i  a : 


I  ^risii9w  drier  jnJtblvlb  ^d  isdniun  dqsonoo^ arid 

<  Q1  \  8d)  t  SI  svfiri  bluow  ew  t8B  yd  SI 


.  s<vn:  n  ■  ,  ->oj  o  si  f  oi  r  t 


.or  as  oar'd  aesl 

. 

.b9*i»dituqon9  s^  ei.ie^ 


■ 


bluow  de9up9T  elqniBe^srid  ,3/ilboo  iscttA  .  ISQI.O  lo 


85 


-1,  31.000012,  -2,  172.000030,  -11,  -3,  6l. 000014, 

-3 J  -1,  1.000001,  -9,  0.165001,  -11,  -3,  -1,  1.000001, 
-6,  0.190001,  -11 

with  the  following  concept  numbers  and  weights :  CHURCH  =  31 
weight  =  12;  STATE  =  172,  weight  =  30;  ECONOMICS  =  6l, 
weight  =  14;  and  PER  =  1,  weight  =  1.  All  operators  (includ 
ing  parentheses)  are  coded  as  negative  integers,  all  numeric 
quantities,  such  as  years,  as  fractions  and  all  index  terms 
as  positive  integers.  Thus,  three  classes  of  possible  input 
can  be  distinguished  in  coded  form  by  the  range  into  which 
they  fall.  The  weights  are  added  to  each  concept  number  as 
fractions . 

At  this  point,  the  user  is  asked  for  the  type  of 
expansion  required  for  the  request.  One  of  these  four 
commands  must  be  input:  GENERAL ,  SPECIFIC ,  RELATED  or 
NONE .  The  request  is  expanded  to  include  the  relevant 
terms  in  an  "or"  relationship. 

A  hierarchy,  depicted  in  Figure  5=1.2,  is  represented 
in  the  computer  by  a  three-dimensional  binary  array,  called 
TREE .  This  hierarchy  is  set  up  manually,  and  is  manually 
entered  in  the  computer.  For  the  purposes  of  the  present 
discussion,  TREE  will  also  be  represented  by  two  matrices 
named  A  and  £.  Both  A  and  B  are  square  matrices,  with  the 
number  of  rows  and  columns  equal  to  the  number  of  index 


. 


ManoiJsIeTr "-so”  ns  n!  efm®$ 


.  Y  :0.:  •  .1  :•. 

' 


86 


terms  in  the  hierarchy.  In  this  example,  there  are  eleven 
rows  and  eleven  columns.  If  Index  term  j  implies  index 
term  i,  e.g.,  HISTORY  implies  CHURCH ,  element  ij  of 
either  matrix  A  or  B  contains  a  one.  Otherwise,  it  con¬ 
tains  a  zero.  If  both  elements  ij  and  ji  contain  a 
one,  index  term  i  implies  index  term  j,  and  index  term 
j  implies  index  term  i  (see  CHURCH  and  TAXES  in  Figure 
5.1.2).  The  hierarchy  in  Figure  5.1.2  depicts  two  types 
of  relationships;  solid  lines  represent  the  direct  relation¬ 
ship  between  terms  and  the  dotted  lines  represent  cross 
reference  index  terms.  Thus,  we  have  two  matrices,  one 
corresponding  to  each  type  of  relationship.  Matrix  A 
depicts  the  direct  relationships  and  matrix  B  depicts  the 
cross  references. 

Associated  with  the  array  TREE  is  a  pointer  vector, 
TPT.  TPT  has  the  same  number  of  elements  as  there  are 
index  terms  in  the  system.  Each  index  term  has  a  concept 
number  which  is  used  to  subscript  TPT .  The  corresponding 
element  in  TPT  gives  the  row  or  column  corresponding  to 
that  term.  If  the  element  Is  zero,  the  index  term  does  not 
appear  in  the  hierarchy.  Figure  5.1.3  shows  the  relation¬ 
ship  between  concept  number,  TPT  and  TREE . 

Each  request  can  be  expanded  in  one  of  three  ways 
(besides  having  no  expansion  done  on  it).  In  order  to 
include  more  general  terms  in  the  request  (GENERAL) ,  the 


■ 


36 


.  •  T  ’  *<'•*  '* 


N, 


87 


MILITARY 

(78)  ^ 


ECONOMICS  DEFENCE 

(61)  (69) 


HISTORY 

(311) 


CHURCH 

(3D 


CLERGY  LANDS  SERVICES 


(48)  (3)  (15) 


^  STATE  \ 
(172)  \ 


\ 

\ 

i  \ 


I  I 

ADMINISTRATION  TAXES 

(182)  (111) 


-  cross-references 


Figure  5.1.2 
Expansion  Tree 


*  •'  •  .  jr 

HHH | 


■  1 


88 


Concepts  : 
( LST ) 


PERPOLITICSLANDS . . .SERVICES. . . CHURCH . . . 


TPT  : 


IT?##  : 
i4  : 


5  : 


0 

1 

1 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1' 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

,0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

(History) 

(Mi Hit  ary ) 

*•(  Church ) 

( State ) 
(Economics ) 

( Defence ) 
(Clergy) 

■*■(  Lands ) 

*>(  Services ) 

( Administration ) 
( Taxes ) 


Figure  5.1.3 

The  Relationship  Between  Concept  Numbers, 

TPT  and  TREE 


0  0  0  0 

0  0  0  0 


ooooooooooo 


. 

£.I.e 

?d-  I  Jo  -not  isswJefl  qJrferro^JfilaH  srfT 


89 


column  of  matrix  A  corresponding  to  the  index  terms  is 
consulted;  to  include  more  specific  terms  (SPECIFIC) ,  the 
row  of  matrix  A  is  consulted.  To  include  related  index 
terms  (RELATED) ,  the  column  of  matrix  B  is  consulted.  For 
example,  to  expand  CHURCH  to  include  specific  terms,  we 
note  that  matrix  elements  i4[3;7]j^[3;8]  and  j4[3;9]  are 
referenced.  These  elements  correspond  to  the  terms 
CLERGY ,  LAND  and  SERVICES .  The  concept  numbers  correspond¬ 
ing  to  the  index  terms  are  merged  with  the  concept  number 
corresponding  to  the  original  index  term  in  an  "or”  relation 
ship.  Upon  expanding  the  example  string  to  the  more  general 
the  following  string  would  result  if  the  tree  of  Figure 
5 . 1 . 2  were  used : 

-1,  -1,  31.000012,  -2,  311.000012,  -11,  -2,  -1,  172.000030, 
-2,  311.000030,  -11,  -11,  -3 ,  -1,  61.000014,  -2,  78.000014, 
-11,  -3,  -1,  1.000001,  -9,  0.165001,  -11,  -3,  -1, 

1.000001,  -6,  0.190001,  -11. 

The  expansion  done  on  the  example  request  was  GENERAL . 
Hence,  the  columns  of  matrix  A  were  consulted.  This  cor¬ 
responds  to  tracing  a  path  one  level  up  the  heavy  line  in 
the  hierarchy  of  Figure  5.1.2.  If,  on  the  other  hand,  the 
expansion  had  been  SPECIFIC ,  the  rows  of  matrix  A  would  be 
consulted,  corresponding  to  tracing  a  path  one  level  lower 
on  the  heavy  lines  of  the  hierarchy.  If  the  expansion  had 


artaJ  xebrU  XflnlafJt'jo  9flcf  o?  3nXbnoqs9*no3 


ori?  II  ?Iye©T  blwow  snjt'itfs  gniwollol  ©rf? 


; 

.OEOOOO.STI  ,X-  ,S-  ,XX-  ,SXCOOO.XX£  ,S-  ,2X0000. i£  ,X-  ,X- 

.bz+luenoo  ©7  >w  k  xItSb m  'to  artrauloo 


>ti3  qu  Isval  ©no  rittsq  a  gjnloBtS  o?  ebnoq89T 


. 


90 


been  RELATED ,  the  columns  of  matrix  B  would  have  been 
consulted.  This  would  correspond  to  following  the  dotted 
lines  in  the  hierarchy.  Returning  to  our  example,  the 
numeric  string  is  a  coded  form  of  the  natural  language 
request 

(( CHURCH  v  HISTORY)  v  ( STATE  v  HISTORY ))  a  ( ECONOMICS  v 
MILITARY )  A  ( PER  >  1650)  a  ( PER  <  1900). 

The  numeric  string  is  converted  to  early  operator 
reverse  Polish  notation.  Lukasiewicz,  a  Polish  logician, 
demonstrated  that  if  operators  were  written  after  their 
operands,  instead  of  between  them,  there  is  never  a  need 
for  parentheses  to  show  association  between  the  terms.  The 
resulting  reverse  Polish  notation  is  amenable  to  searching 
techniques.  For  example,  the  request 

( ECONOMICS  v  (RELIGION) )  a  (PER  >  1921) 
has  a  reverse  Polish  representation  as  follows : 

ECONOMICS ,  RELIGION ,  v,  PER ,  1921,  >,  a0 

Hamblin  (1962)  gives  a  full  description  of  possible  Polish 
notations,  and  the  rules  by  which  a  string  is  transformed 
to  a  particular  Polish  notation.  Returning  to  our  example 
with  the  numeric  string,  its  reverse  Polish  notation  is 


t  ' 


■ 


. 


■ 


rislio**  ->3'i9V9'i  e$I  €£Ct£rtJs  olismin  edJ  rftfjtw 


« 


91 


31.000012,  311.000012,  -2,  172.000030,  311.000030, 

-2,  -2,  61.000014,  78.000014,  -2,  -3,  1.000001, 
0.165001,  -9,  -3,  1.000001,  0.190001,  -6,  -3. 

The  corresponding  natural  language  request  is 

CHURCH ,  HISTORY ,  v,  STATE ,  HISTORY ,  v,  v,  ECONOMICS , 
MILITARY ,  v.  A,  PPF,  1650  ,  >,  a,  PFF ,  1900  ,  <,  a. 

At  this  point,  the  request  has  been  edited,  converted  to 
a  numeric  string,  weighted  and  further  transformed  into 
reverse  Polish  notation  and  is  ready  for  the  search.  The 
user  is  asked  if  a  search  should  begin.  If  YES ,  the  match¬ 
ing  of  each  entry  in  the  file  with  the  coded  request  vector 
takes  place.  Document  numbers  satisfying  the  request  are 
saved  in  a  vector  WV s  and  the  number  of  documents  satisfying 
the  request  is  output  to  the  user  as  COUNT  =  XX*  Document 
numbers,  sorted  on  the  number  of  index  terms  satisfying  the 
request  and  weights  given  to  each  index  term,  may  then  be 
listed. 

The  information  in  the  system  consists  of  data  files 
and  pointer  vectors  used  to  locate  entries  In  the  data  files. 
Three  files,  with  their  corresponding  pointer  vectors,  form 
the  basis  of  the  system.  The  first  file,  the  thesaurus  (see 
Figure  5.1.4),  contains  the  authorized  index  terms  (AST). 

The  concept  number  of  each  term  is  directly  related  to  the 


anreed  X'-,Kr-  t  eritf  ^  ,  ertddmun 

.bs4*  :I 


*aJte  •  -If  n  '  .;.i\  I  !.;  srtV 


asi'i^rre  oXboo!  o;  i:>9eu  a^iododv  'isdnJtoq  bn b 

■ 

»s)  siriuBasrfd  ©rid  *91X1  dail'l  >rf*3f  ..hsJb^b  ©rid  lo  siesd  ©rid 

:  .6^  xebnX  bssiaoridi/s  erfd  anl-sdnoo  #(|i.I.c?  snugr-I 


92 


Concept 
Number : 

LSTPT : 


LST : 


12  3  4  5 


Figure  5.1.4 
Thesaurus 


PMPT 


PERMEM : 


93 


143 

ir 


- TITLE  FOR  FIRST  DOCaA UTHOR  NAME° JOURNAL  NAME\ 


*•1 


3 INDEX  TERM  l^INDEX  TERM  2 d...^  E±4.B,STi?4 CT  1 


- TITLE  2a AUTHOR  2° JOURNAL  2^XT  N1^.,.^XT  NMlABS  2| 


93 


-TITLE  3a... 


143 


Figure  5.1.5 

Peripheral  Memory  Simulation 


I  - - -f—  —  -  II.  ..■■I.-..—  - . .  I  — 


- — - : - — — - - ■■  ■■ .  ..  .1.1  I—  . . 


93 


position  in  the  file  that  the  term  occupies;  the  first  term 
has  concept  number  one,  the  second  concept  number  two,  etc, 
Associated  with  LST  is  a  vector,  LSTPT ,  indicating  the 
starting  position  of  each  new  index  term,  comparable  to  the 
approach  taken  by  PERMEM  and  PMPT.  It  is  this  file  that  is 
consulted  when  editing  index  terms  and  determining  concept 
numbers . 

The  second  file,  PERMEM ,  (see  Figure  5.1.5)  consists 
of  all  input  data  as  they  are  entered  by  the  user.  All 
alphabetic  information  which  will  subsequently  be  output  in 
response  to  the  PRINT  command  is  contained  in  this  file. 

For  example,  the  title,  author,  journal,  index  terms  and 
abstract,  all  in  natural  language  form  are  stored  on  sim¬ 
ulated  peripheral  equipment;  the  address  of  the  document 
in  this  file  serves  as  the  document  for  processing  purposes. 
Special  characters  serve  to  separate  each  section  of  the 
document.  A  pointer  vector,  PMPT ,  is  also  stored  in  the 
file  PERMEM .  Each  component  corresponds  to  a  document 
number.  The  contents  of  PMPT  at  this  location  is  the 
address  of  the  starting  character  of  the  document  corres¬ 
ponding  to  the  document  number. 

The  third  file,  a  codified  version  of  the  natural 
language  file  PERMEM ,  is  the  file  directly  utilized  by  the 
computer;  all  operations  of  the  computer  are  done  on  this 
file.  It  consists  of  two  subfiles:  the  main  subfile,  MEM, 


. 


'T  :  '  '  \T .  ’  "  '  .1  ‘  •'*  e,  2V  --  fjl  a 

.  *  |  \  m 


..  i,  iijJfi'T,  Sri*  'to  nolanev  Jbsniboo  $  b*itci3  ertT 


snob 


94 


and  the  years  subfile,  YMEM.  MEM  is  an  inverted  file,  with 
every  index  term  (excluding  years),  every  author  and  every 
journal  in  the  system  occurring  as  an  entry.  Under  each 
of  these  entries  there  is  a  list  of  document  numbers  (or 
document  addresses),  each  number  identifying  a  document 
indexed  by  this  specific  index  term.  For  example,  if 
concept  number  76  was  used  as  an  index  term  for  documents 
173,  HI  and  683,  and  concept  77  was  used  as  an  index  term 
for  document  731,  part  of  the  main  file  would  appear  as  in 
Figure  5.1.6. 

The  main  subfile  consists  of  MEM  and  PT;  MEM  is 
variable  in  length.  Each  concept  has  two  positions  in 
which  to  record  document  numbers.  A  third  position  contains 
an  address  pointing  to  the  row  containing  the  next  two  docu¬ 
ment  numbers  corresponding  to  the  same  concept  number.  If 
there  are  no  more  documents,  the  third  column  contains  the 
special  numeric  code  999999*  Hence,  as  more  documents  are 
indexed  under  a  single  concept  number,  the  file  grows  in 
size.  However,  it  is  not  wasteful  of  space,  since  there  is 
at  most  one  vacancy  under  any  concept  number.  In  order  to 
retrieve  all  documents  indexed  by  a  specific  term,  the 
document  numbers  are  retrieved  by  chaining  through  the  file 
until  the  terminating  numeric  code,  999999,  is  encountered. 
As  the  file  grows  in  size,  and  as  experience  accumulates, 
the  file  can  be  revised  to  minimize  the  number  of  zero 


el  WXM  bn&  1o  edelaqoo  sllldjj8  nt&m  ©riT 


/ 

.'ladmun  dqsonoo  ©maa  ©rid  od  gnlbnoqaonoo  aTadmun  dnem 
•rid  3rUsdnoo  nmuloo  biirid  9rid  cedn©muoob  ©lotfr  on  ©*xb  ©nertd 


t'l  ©rid  4i9diwn  dqsonoo  ?  snl2  £  Hrabnu  baxebnl 


,  ©*1  e 

o  ■  '■  "*:  .  v.o.’ro  -•  HO  ■  \ :: m  da 

' 

. 


95 


elements.  Thus,  the  variable  NC  determines  the  number  of 
columns  in  MEM .  Entry  to  the  file  is  made  by  the  pointer 
vector,  PT .  The  index  term's  concept  number  is  used  as  a 
subscript  on  PT ;  the  contents  of  PT  at  this  location  is 
the  first  row  in  MEM  corresponding  to  this  concept  number 
( see  Figure  5.1.6). 

The  second  subfile,  YMEM ,  contains  the  period  that 
each  document  covers.  Special  treatment  is  given  to  the 
problem  of  periods,  or  dates,  dealt  with  by  the  document. 

It  would  be  impractical,  but  possible,  to  use  every  year 
from  zero  to  1967  as  a  separate  index  term.  Instead  the 
following  method  was  used:  each  document  which  has  years 
as  one  or  more  of  its  index  terms  is  entered  in  a  table  as 
pictured  in  Figure  5.1.7*  The  first  column  contains  the 
document  number,  the  second  column  the  starting  date,  the 
third  either  a  date  or  a  zero  depending  on  whether  the 
document  deals  with  two  specific  dates  or  a  range  of  years, 
and  the  fourth  contains  a  date.  By  allowing  a  range  of 
years  to  be  used  by  inserting  a  zero  in  the  third  column, 
much  repetition  of  dates  is  avoided. 

Presently,  no  files  are  on  peripheral  devices  such 
as  disc  or  magnetic  tape.  All  are  simulated  in  memory  by 
means  of  matrices  and  vectors,  and  remain  in  memory  through¬ 
out  processing.  This  is  due  to  the  lack  of  input  and  output 
commands  of  the  programming  language  which  will  be  rectified, 
it  is  hoped,  in  the  near  future. 


.*  t'ts  :  s-'.t  $3nor  tW  ^  ,  ii';:'  > a  bnooe-i  srtT 


bEsJeir  .imssi  *9bn.t  9 $&naq?3  s  sb  o3  0T9S  moil 


*  3  -ai'- 310*193  xebnl  :3_  1o  worn  to  9no  es 
os  nrnuloo  3«nH  9rfT 

/ 


t93fib  srrl3'Ts3a  0.13  nmuloo  brtocss  ed3  t'i9(teun  3n9fcuj£>ob 


i!39rfw  no  snlbneqeb  oies  b  10  eJBb  b  verifls  brtttid 


.;  5TB  :o  i  r-b  roaqa  Ovi  d.3iw  ;  •'  S9b  insnuoob 


bri3  ni  OT9S  s  snJt3*i9snl  ^  beau  ed  o 3  8<i£$^ 

.b^biovB  et  893fib '1c  noi’r3^  i  do  im 

. 

reb  IjST9riq  T9q  no  ezs  aeLt'i  on  t^X3H939rrI 


' 


96 


Concept 
Number : 

PT: 


MEM : 


75 

76 

77 

78 

85 

102 

103 

180 

102- 

103 


l82< 


Row 
Number 


- - 

“ — - - - 

173 

111 

182 

731 

0 

999999 

683 

0 

999999 

Figure  5.1.6 
Main  Subfile 


YMEM : 


621 

1921 

1923 

1930 

1161 

i860 

0 

1902 

31 

1850 

999999 

999999 

Document 

Number 

First 

Year 

^  J 

Year 

or 

Code 

Last 

Year 

Figure  5*1.7 


Years  Subfile 


CHAPTER  VI 


COMPARISON  AND  EVALUATION  OF  THE  SARA  SYSTEM 

6 . 0  Introduction 

The  SARA  system  has  been  described  in  detail,  and 
illustrations  of  its  use  appear  in  Appendix  C.  This  chapter 
will  suggest  improvements  to  the  programming  language,  com¬ 
pare  SARA  with  other  on-line  systems  and  discuss  the  strengths 
and  weaknesses  of  the  SARA  system. 

6 . 1  APL  As  a  General  Programming  Language 

APL  is  the  only  programming  language  used  to  develop 
the  mechanized  portion  of  the  SARA  system.  As  the  applica¬ 
tion  encompasses  not  only  the  handling  of  matrices  (in 
peripheral  equipment  simulation)  but  also  much  general  data 
processing,  e.g. ,  character  handling  and  input  and  output 
considerations,  an  opinion  could  be  formed  as  to  the  suit¬ 
ability  of  the  language  as  a  general  processing  language. 

The  language  is  extremely  powerful  when  dealing  with 
arrays  and  performing  most  mathematical  operations.  For 
example,  to  test  if  the  sum  of  components  of  a  matrix  A 
is  equal  to  a  scalar  value,  X,  one  need  only  type 

Y  +-  X  =  +/  +  /A  . 

Y  would  then  contain  either  1  (true)  or  0  (false). 


i  J xw  leet  nerfw  luj’iewoq  v.L+MsrLfXB  s t  ^^fiugnjBl  ariT 

■ 

.  «  ..  "  ).* 


.  (ssIeD  0  to  (suT t)  X  TsriJXs  nXaJnoo  nsrt)  blvov  X 


98 


However,  other  aspects  of  the  language  could  be 
Improved.  This  Investigation  indicated  several  ways  in 
which  the  language  could  be  modified  so  that  programs  for 
information  storage  and  retrieval  could  have  been  written 
more  conveniently.  These  are  as  follows: 

a)  When  dealing  with  functions,  the  user  does  not 
have  the  power  to  declare  any  function  parameter  as  local 
or  global,  i.e.,  call-by-value  or  call-by-name.  All  vari¬ 
ables  are  assumed  local.  Hence,  if  more  than  one  variable 
containing  a  large  number  of  elements  is  to  be  operated  upon 
in  exactly  the  same  manner,  separate  functions  must  be 
written  to  handle  each  variable,  and  each  function  cannot 
have  the  variable  as  a  parameter.  This  arises  since  a 
temporary  copy  is  made  of  a  variable  appearing  as  a  function 
parameter  while  the  variable  is  being  operated  on  in  the 
function.  If  the  variable  contains  a  sufficient  number 

of  elements  to  overflow  memory  capacity  if  duplicated,  it 
cannot  be  used  as  a  function  parameter.  Such  variables 
often  occur  in  information  retrieval  programs,  e.g., 
thesaurus  lists,  or  lists  of  document  numbers.  By  giving 
the  user  the  power  to  declare  a  variable  in  a  function 
header  as  global  (and  thus  avoid  duplication),  this  problem 
could  be  avoided. 

b)  At  present,  a  function  is  limited  to  two  para¬ 
meters  which  may  be  scalars  or  arrays  of  any  dimension. 


enottfortJl  ^^a'lsqsa  t-iwina:a  sunae  arid  ^Xdoaxa  at 


•§;•  tsd  si  sidBl'iBV  arid  9  r  t;r  w  r,t  " '•  -srrjsq 


.nobdorurt 


*drnun  d  .  •  :.uci  a  8.f«iX  10  taJ8JtX  ijj'ijjfiaerid 


:■ 

iq  sirid  t  (no Idr.DlIqub  bJtovs  ayrfd  bna)  XadoXg  as  T9bssri 


,b9bXovfi  ad  bluoo 


99 


It  would  be  convenient  if  functions  could  be  extended  to 
include  a  larger  number  of  parameters. 

c)  List  processing  techniques  have  many  applications, 
particularly  in  information  retrieval.  Commands  which  would 
enable  data  to  be  processed  as  lists  would  also  be  useful. 

d)  Looping  is  necessary  for  dealing  with  iterative 
procedures  in  functions.  At  present,  no  convenient  looping 
command  comparable  to  the  DO-statement  in  Fortran  exists  for 
APL.  An  instruction  which  would  enable  the  starting  value, 
increment,  ending  value,  variable  incremented  and  end  of  the 
loop  to  be  defined  would  be  useful. 

e)  At  present,  APL  is  only  available  as  an  interpreter. 
No  object  code  is  generated  and  hence  much  work  must  be 
duplicated  in  interpreting  each  statement  each  time  through 

a  loop.  If  programs  could  be  debugged  in  interpretive  mode, 
and  subsequently  compiled  into  efficient  object  code  for 
later  production  runs,  the  language  would  be  more  efficient. 

f)  One  of  the  largest  bottlenecks  in  the  language  at 
present  is  its  inability  to  communicate  with  peripheral 
devices  other  than  the  user's  terminal.  There  are  no  input 
or  output  commands  to  peripheral  devices.  If  one  could 
direct  the  system,  from  the  terminal,  to  input  from  a  large 
storage  capacity  device,  or  to  output  data  to  peripheral 
devices  such  as  printer  or  magnetic  tape,  the  language  would 
become  very  attractive  to  the  general  computer  user  and  also 


■ 


ad  daum  >How  riojjis  sorted  brrB  bsdST9nss  el  9boo  dostdb  oM 

■ 


iTsq  riilw  9de:  ly^-ioo  od  ydll'/dsnl  adl  it  d  iswiq 


. 


•fid  d09Tlb 


[ft'iodql'tsq  od  adsb  duqdwo  od  no  « solve b  ^dloaqso  ess'xode 


100 

the  information  retrieval  programmer.  Implicit  in  this 
proposal  is  the  ability  to  format  the  data  as  desired  by 
the  user. 

On  the  whole,  APL  is  an  improvement  over  existing 
languages  when  used  for  communicating  with  the  computer 
directly o  It  is  very  powerful  when  dealing  with  arrays 
or  arithmetical  applications.  This  power,  unfortunately, 
does  not  extend  to  features  required  to  make  it  a  more 
general  purpose  language. 

6 . 2  SARA  and  Other  On-Line  Systems 

Other  on-line  information  systems  have  been  developed. 
SARA  is  compared  and  contrasted  to  four  systems  discussed 
previously:  CONVERSE,  TIP,  SMART  and  BOLD. 

The  CONVERSE  system  does  not  give  the  user  much  control 
over  the  formulation  of  requests  or  output  format.  For 
example,  the  type  of  output  given  to  the  user  after  complet¬ 
ing  a  search  is  dependent  on  the  number  of  matches  made; 
also,  any  expansion  of  requests  must  be  done  manually.  The 
SARA  system  allows  for  more  options  and  control  than  the 
CONVERSE  system,  such  as  automatic  expansion  upon  request. 

The  Technical  Information  Project  uses  a  computer  with 
capabilities  comparable  to  the  computer  used  by  the  SARA 
system;  it  has  remote  consoles  connected  directly  to  a  time- 
shared  computer.  However,  the  approach  taken  by  TIP  is 


od  f)3^8Bn:fnoo  tns  b^'iBqmoo  si  AfiAS 


QJOQ  fcfl^TflAMS  taiT  t38fl3Wf00 


' 


s9rio^Bm  lo  Tedfru/n  arid 


\ 


:  J:  -■  ..  '  ‘  j't r  .  «oinna.?/i  sriT'  ....  *f 


^Idoaitp  bsdosnrtos  ssIoshod  sdoiaai  Esri  dl  ^made^e 


101 


unlike  that  taken  by  SARA.  Where  SARA  requires  the  manual 
indexing  of  documents  prior  to  storage  in  the  computer,  TIP 
uses  only  the  title,  author,  source  journal  and  bibliographic 
data  of  each  document.  Little  professional  manual  effort  is 
required  to  prepare  documents  for  the  TIP  system.  The  search 
strategies  then  vary  according  to  the  differences  in  the  type 
of  data  stored.  TIP  users  chain  from  one  document  to  the 
next  via  the  bibliographic  data,  while  SARA  users  retrieve 
documents  by  matching  index  terms  in  the  request  and  documents. 

The  SMART  system,  in  part,  resembles  SARA.  In  general, 
the  SMART  system  has  greater  capabilities  than  the  SARA 
system.  It  accepts  the  full  text  of  a  document  as  input, 
and  automatically  indexes  the  document.  Searches  are  then 
made  on  these  indexed  documents;  the  user  retains  a  great 
deal  of  control  over  the  expansion  of  the  request  and  output 
format.  SARA  requires  the  manual  indexing  of  the  documents, 
but  provides  a  search  procedure  resembling  that  of  the 
SMART  system.  The  user  has  options  available;  he  retains 
control  over  the  formulation  of  the  request,  the  subsequent 
search  and  the  output . 

The  BOLD  system  requires  that  documents  be  manually 
indexed  prior  to  their  storage,  and  requires  a  list  of 
authorized  index  terms,  similar  to  SARA.  However,  in  order 
to  search  for  documents  in  the  BOLD  system,  the  user  chains 
through  the  hierarchy  of  authorized  index  terms.  The  terms 


* 

' 


.*jn 9  : J^ob  v*rfcf  f-s  Hit  Vi  so  v  K  Jus  pbnL- 


■ 


■ 

■ 


102 


at  the  end  of  the  tree  (or  any  level  prior  to  the  end)  are 
used  as  index  terms,  and  the  user  initiates  a  search  with 
these  terms.  Matches  are  output.  SARA,  on  the  other  hand, 
does  not  require  the  user  to  follow  through  the  hierarchy 
of  terms;  any  index  term  connected  with  any  other  by  Boolean 
operators  is  valid. 

6 . 3  Strengths  and  Weaknesses  of  the  SARA  System 

The  mechanized  portion  of  the  SARA  system  has  many 
favourable  features.  Since  it  is  programmed  for  a  time- 
shared  computer,  the  availability  of  the  system  is  restricted 
only  by  availability  of  the  computer  and  typewriter  consoles. 
The  system  requires  little  effort  on  the  part  of  the  user  to 
learn  its  use.  Response  to  queries  is  almost  instantaneous. 

If  an  error  is  detected  in  a  request,  e.g.,  a  misspelled  index 
term,  the  user  is  notified  immediately  and  the  request  can  be 
corrected.  If  a  request  does  not  retrieve  a  sufficient  number 
of  appropriate  documents,  it  may  be  expanded  and  hence  increase 
the  number  of  documents  retrieved.  Pull  English  words  are 
used  to  communicate  with  SARA;  no  codes  need  be  remembered. 

For  the  user  familar  with  the  system,  abbreviations  of  the 
full  words  may  be  used,  e.g.,  YES  may  be  abbreviated  to  Y. 

The  SARA  system  has  options  available,  and  is  easy  to  use. 
However,  there  is  room  for  improvement. 


.bilBV  zt  8'io^B‘ieqo 


od  *i  j s  u  lo  3*isq  arid  no  dnolle  elddll  asnJtupen  msde^e  srfT 


' 


sIoITjue  s  9V~jlncJ9r[  don  eeofe  dasups*!  b  'll 

it  sonerf  bne  £>9bnsqx9  9d  y.Btn  dl  e  ednsraiioob  9d£ltqo'xqq£v  io 

i  ■  v - 


’ 


103 


Only  a  portion  of  the  information  storage  and  retrieval 
cycle  is  automated  by  SARA;  manual  indexing  and  abstracting 
is  still  required.  The  SARA  system  could  be  expanded  to 
include  the  mechanization  of  these  processes.  The  present 
system  is  limited  to  upper  case  characters  only.  By  modify¬ 
ing  some  routines  slightly,  and  by  using  an  upper  and  lower 
case  typeball  on  the  typewriter,  this  limitation  can  be  over¬ 
come.  As  the  number  of  entries  in  the  main  file  increases, 
the  response  time  to  the  search  request  will  increase.  Also, 
with  the  incorporation  of  input  and  output  commands,  the 
system  could  handle  an  increased  number  of  documents.  However, 
the  capabilities  of  SARA,  like  the  capabilities  of  other 
mechanized  information  systems  such  as  SMART,  TIP,  CONVERSE 
or  MEDLARS,  are  still  relatively  elementary  when  compared  to 
human  capabilities.  Most  mechanized  systems  rely  on  Boolean 
operators  only;  human  capabilities  such  as  recognizing  near- 
synonymous  words,  are  not  incorporated  into  the  mechanized 
systems.  Further  research  could  be  undertaken  to  improve 
and  extend  the  SARA  system.  It  is  hoped  that  this  project 


will  be  continued. 


:9dtru.  be  a  i  si  1 1  ns  eioruBri  bluoo  uele^:; 


«3.:T  ,TflAM8  es  rH^ie  amedsys  rtoldsanolctt  besttiBcioem 


fieri w  ^'iB^nemels  vIsvltfBlea  IIlXee'XB  t8flAJ(I3M  io 


sn  gnlsingooeT  86  cioup.  edlJlIJdsqBD  rtsmurt  ;y;Ino  ziojB'isqo 


3  nstisdiebruf  :>d  bluoo  rionfiseeT  'tsrid'iuH 


' 


. 


BIBLIOGRAPHY 


American  Bibliographic  Center,  1965.  Guidance  Booklet, 
America:  History  and  Life,  Clio  Press,  Santa 

Barbara,  Calif. 

Baxendale,  P.B.,  1958.  "Machine-made  index  for  technical 
literature",  IBM  Jour.  Res.  Dev.,  2:354-361. 

Becker,  J.  and  R.M.  Hayes,  1963.  Information  Storage  and 
Retrieval:  Tools,  Elements,  Theories,  J.  Wiley  and 
Sons,  New  York. 

Bledsoe,  W.W.  and  I.  Browning,  1959.  "Pattern  recognition 
and  reading  by  machine",  1959  Proc.  of  the  East. 

Joint  Comp.  Conf . ,  16:225-232.  Also  in:  L.  Uhr, 

( 1966 ) . 

Borko,  H.  and  M.  Bernick,  1963.  "Automatic  document 

classification".  Jour.  Assoc.  Comp.  Mach.,  10:151-162. 

Borko,  H.  and  M.  Bernick,  1964.  "Automatic  document 

classification:  Part  II,  additional  experiments". 

Jour.  Assoc.  Comp.  Mach.,  11:138-151. 

Bourne,  C.P.,  1963*  Methods  of  Information  Handling, 

J.  Wiley  and  Sons,  New  York. 


.  CJ 


■ 


' 


/ 


.  ;  Of  -  i  > 


N 


D  .1  >  A  .  ' '  ■ 

' 


,  r  .-ta--  :*xe»nxe  I  a  no  v  bH  ;fia23sol'H  3 eLo 

1 


105 


Brain,  A.E.,  G.E.  Forsen,  N J.  Nilsson  and  C.A.  Rosen, 

1962.  "Learning  machines".  International  Science 
and  Technology,  658—669  - 

Brandhorst ,  W.T.  and  P.F.  Eckert,  1966.  Guide  to 

Processing,  Storage  and  Retrieval  of  Bibliographic 

Information  at  the  NASA  Scientific  and  Technical 
Information  Facility,  Documentation  Incorporated, 
College  Park,  Md. 

Brown,  S.C.,  1966.  "A  bibliographic  search  by  computer", 
Phys.  Today,  19:59-64, 

Burnaugh,  H.P.,  1966.  The  BOLD  (Bibliographic  On-Line 
Display)  System,  System  Development  Corp.,  Santa 
Monica,  Calif. 

Calingaert ,  P.,  1968.  Introduction  to  APL,  Science  Research 
Associates,  Inc.,  Chicago. 

Damereau,  F.J.,  1965.  "An  experiment  in  automatic  indexing", 
Amer.  Doc . ,  16:283-289. 

Datatrol  Corporation,  1965.  Final  Report  on  Phase  I  - 

Systems  Design  and  Action  Plan  for  the  Pesticides 

Information  Center,  National  Agricultural  Library, 


Washington,  D.C. 


.dd£I  t  .  ‘I . H  c riguarn^a 


. 


106 


Doyle,  L.  ,  1965.  "Is  automatic  classification  a  reasonable 
application  of  statistical  analysis  to  text?", 

Jour.  Assoc,  Comp,  Mach.,  12:473-489. 

Drew,  D.L.,  R.K.  Summit,  R.I.  Tanada  and  R.B,  Whitely, 

1966.  "An  on-line  technical  library  reference 
retrieval  system",  Amer .  Doc .  ,  17:3-7. 

Edmundson,  G.P.  and  R.E.  Wyllys,  1961.  "Automatic 

abstracting  and  indexing  -  a  survey  and  recommendations". 
Comm.  Assoc.  Comp.  Mach. ,  4:226-234. 

Falkoff,  A.D.  and  K.E.  Iverson,  1966.  APL  \  360 , 

International  Business  Machines  Corp.,  Yorktown 
Heights,  New  York. 

Feigenbaum,  E.  and  J.  Feldman,  1963.  Computers  and  Thought, 
McGraw-Hill,  New  York. 

General  Electric  Company,  1963.  The  MEDLARS  Story  at 

the  National  Library  of  Medicine,  U.S.  Department 
of  Health,  Education  and  Welfare  Public  Health 
Service,  Washington,  D.C. 

Grimsdale,  R.L. ,  F.H.  Sumner,  C.J.  Tunis  and  T,  Kilburn, 

1959.  "A  system  for  the  automatic  recognition  of 
patterns",  Proc.  Inst,  of  Elec.  Eng„ ,  106B : 210-221 . 


Also  in:  L.  Uhr,  ( 1 9 6 6). 


t  •  s'loO  asnifioBFi  sjsnlaufl  Ij£>noi;f8m9;JfiJ 


.'  •'  v:  <  •  .  -  oli 


£d$I  taambl fcn &  .3  ..nuBdndgls^ 


jffoi'  teV.  t  X .  IH-wbiOoM 


is  Y/i o^2  £fiAJ(J3M  $riT  .fcd£i  ;  tfiBCPRoO  olitfoeiS  IsiansO 

.  •  ■%. 


.O.a  .iio^nlfiaBW  ..soivnaS 


elnuT  .  In  0,  «'S9nrrtuE;  .H.3  t.J.H  tsXfibemiTO 

L  J 


■ 


107 


Gyr ,  J.W.,  J.S.  Brown,  R.  Willey  and  A.  Zivian,  1966. 

"Computer  simulation  and  psychological  theories 
of  perception".  Psych .  Bull . ,  65:174-192. 

Hamblin,  C.L.,  1962.  "Translation  to  and  from  Polish 
notation".  Comp .  Jour . ,  5:210-213. 

Iverson,  K.E. ,  1962.  A  Programming  Language ,  J.  Wiley 
and  Sons,  New  York. 

Kent,  A.,  1962.  Textbook  of  Mechanized  Information 
Retrieval ,  Interscience  Publishers,  New  York. 

Kessler,  M.M. ,  1963a.  "An  experimental  study  of  biblio 
graphic  coupling  between  technical  papers", 

IEEE  Trans.  PTGIT,  IT-9:49-50. 

Kessler,  M.M.,  1963b.  "Bibliographic  coupling  between 
scientific  papers",  Amer .  Doc . ,  14:10-25. 

Kessler,  M.M. ,  1965a.  "The  MIT  technical  information 
project",  Phys.  Today,  18:28-36. 

Kessler,  M.M.,  1965b.  "Comparison  of  results  of  biblio 
graphic  coupling  and  analytic  subject  indexing", 
Amer .  Doc . ,  16:223-233. 


' 


. 


&ri  Iquoo  olrlqBTa 

tT10-?q  :  J  J 


* 


-o +±d'  _  8  f I u •  m  3  j.  rnoO M  .cf<?d$I  « ; f •  •  tf  ti9.‘ae9/i 


'j  *  .  •  9filA 


108 


Licklider,  J.C.R.,  1965.  Libraries  of  the  Future,  The 
M.I.T.  Press,  Cambridge,  Mass. 

Luhn,  H.P.,  1958.  "The  automatic  creation  of  literature 
abstracts",  IBM  Jour.  Res.  Dev.,  2:159-165. 

Marden,  E.C.,  1965.  HAYSTAQ,  A  Mechanized  System  For 
Searching  Chemical  Information,  National  Bureau 
of  Standards  Technical  Note  264,  Washington,  D.C. 

Maron,  M.E.,  1961.  "Automatic  indexing:  an  experimental 
enquiry".  Jour.  Assoc.  Comp.  Mach.,  8:407-417. 

Minsky,  M. ,  1961.  "Steps  toward  artificial  intelligence", 
Proc.  IRE,  49:8-13.  Also  in:  E,  Feigenbaum  and 
J.  Feldman,  ( 1 9 6 3 ) . 

Nilsson,  N.J.,  1965.  Learning  Machines ,  McGraw-Hill, 

New  York. 

Prather,  R.C.,  and  L.M.  Uhr,  1964.  "Discovery  and  learning 
techniques  for  pattern  recognition",  Proc.  Assoc. 
Comp,  Mach.  19th  National  Conf . ,  Philadelphia,  Pa. , 
P-64 : D2 . 2-1  -  D2.2-10. 


Roberts,  L.G.,  i960.  "Pattern  recognition  with  an  adaptive 


framework" ,  IRE,  i960  International  Convention 
Record ,  2:66-70.  Also  in:  L.  Uhr,  (1966) . 


-  -  - 


.(Eden  ,nf  uA  ■  .1 

tno$aIiH 


-  ; 


cidi w  noltflngoosa  meJ^B'T' 


109 


Rosenblatt,  F. ,  i960.  "Perceptron  simulation  experiments", 
Proc.  IRE,  48:301-309. 

Salton,  G.,  1963.  "Some  hierarchial  models  for  automatic 
document  retrieval",  Amer .  Doc . ,  14:213-222. 

Salton,  G.,  1964.  "A  document  retrieval  system  for  man- 
machine  interaction",  Proc.  Assoc.  Comp.  Mach. 

19th  National  Conf . ,  P-64:L2.3-1  -  L2.3-20. 

Salton,  G. ,  1965.  "The  evaluation  of  automatic  retrieval 
procedures  -  selected  test  results  using  the  SMART 
system",  Amer .  Doc . ,  16:209-222. 

Salton,  G.  and  M.E.  Lesk,  1965.  "The  SMART  automatic 
document  retrieval  system  -  an  illustration". 

Comm.  Assoc.  Comp.  Mach.,  8:391-398. 

"Schizophrenia  reading  made  easy",  1966.  Journal  of 
Data  Management,  October. 

Selfridge,  O.G.  and  U.  Neisser,  i960.  "Pattern  recognition 
by  machine",  Scientific  American,  203:60-68.  Also 
in:  E.  Feigenbaum  and  J.  Feldman,  (1963). 

Silva,  G.  and  C.J.  Bellamy,  1965.  Language  Data  Processing 
Concordance  Generation,  Monash  University  Computer 


Center  Publication  R.  2,  Melbourne,  Australia. 


Jr:.  ..  '  ...  ,  iv  ...  £ : 


. 


-3  0P  f  :■  .5  >1  >  .O.C  3zbl'i'l£$8 


.  •„  1  '  '  '  v:  \ 

■ 


110 


Swanson,  D.R.,  1964.  "Design  requirements  for  a  future 
library".  Libraries  and  Automation,  Library  of 
Congress,  Washington,  D.C. 

Tasman,  P.,  1957.  "Literary  data  processing",  IBM  Jour. 
Res .  Dev. ,  1:249-256. 

Uhr,  L. ,  1963.  "Pattern  recognition  computers  as  models 
for  form  perception",  Psych .  Bull . ,  60:40-73. 

Uhr,  L. ,  1966.  Pattern  Recognition,  J.  Wiley  and  Sons, 
New  York. 

Vickery,  B.C.,  1965.  On  Retrieval  System  Theory,  (Second 
Edition),  Butterworths ,  London. 

Walston,  C.E.,  1965.  "Information  retrieval",  Advances 
In  Computers  ,  6:1-30. 


.O.a  tnQ$%nlt\seVl  'fte**x$£oQ 


APPENDIX  A 


SELECTION  AND  INDEXING  OF  DOCUMENTS 

The  documents  used  to  test  the  SARA  system  consist 
of  manually  indexed  papers  appearing  in  the  two  history 
journals  Agricultural  History  and  Saskatchewan  History. 

The  papers  were  indexed  by  a  Graduate  Research  Assistant 
in  the  Department  of  History  at  the  University  of  Alberta 
in  collaboration  with  the  author.  The  history  participant 
chose  all  articles  to  be  indexed,  all  index  terms  and  the 
relationship  between  index  terms.  The  author's  role  was 
primarily  to  maintain  sufficient  control  over  choice  of 
index  terms  and  the  indexing  so  that  they  would  be  compatible 
with  the  computer.  Much  assistance  was  willingly  given  by 
everyone  concerned.  No  extensive  user  studies  have  been 
carried  out  on  the  SARA  system.  Figure  A.l  depicts,  in 
block  diagram  form,  the  general  flow  of  the  statement  and 
solution  of  the  problem. 


' 


112 


Figure  A.l 


Statement  and  Solution  of  the  Problem 


. 


113 


The  words  "cue  word",  "key  word",  "descriptor"  and 
"index  term"  often  appear  in  the  literature  to  describe  the 
same  concept,  i.e.,  a  word  or  short  phrase  which  describes 
part  or  all  the  content  of  a  document.  In  this  thesis,  the 
word  "index  term"  is  used.  All  decisions  concerning  the 
selection  of  index  terms  for  testing  the  SARA  system  rested 
on  one  person.  In  order  to  check  the  indexing,  opinions  of 
more  people  with  backgrounds  in  history  should  be  sought.  A 
publication  of  the  American  Bibliographic  Center  (1965)  was 
consulted,  and  an  adaptation  of  the  index  terms  in  this 
publication  was  used.  Small  changes  in  the  format  of  the 
index  terms  were  incorporated,  such  as  replacing  blanks  by 
hyphens,  to  aid  in  computer  manipulation.  Two  levels  of 
indexing  were  used.  The  first  level  indicated  the  general 
area  of  interest,  such  as  politics  (POL)  or  religion  (REL) . 
The  second  level  limits  these  broad  fields  to  more  narrow 
concepts,  e.g.,  REL  is  limited  by  index  terms  such  as 
"Roman-Catholic"  and  "Missions".  The  two  levels  of  indexing, 
it  was  felt,  made  indexing  easier  since  the  general  topic  or 
topics  of  the  article  are  required  prior  to  those  described 
by  specific  index  terms.  A  coding  sheet  was  designed  by  the 
author  (see  Figure  A. 2)  onto  which  documents  were  indexed 
directly.  Instructions  for  the  correct  use  of  this  coding 
sheet  must  be  provided  for  the  indexer.  Cards  can  be  punched 
from  this  coding  sheet,  or  the  sheets  may  be  used  for 
direct  input  to  the  computer  through  the  console.  After 
choosing  a  sample  of  articles  to  be  indexed,  the  history 


■ 

:  . 


o  t^nlX9bnl  add  >io9rfo  od  *i3r  to  ft 


. 


, 

' 

■ 


. 

. 

■ 

•  is.,  bnl  add  toI  bsbl'^Qiq  9d  d8i/m  daarie 

■ 

■ 

t  itrX  >o.  1  a*!  ol  'i i  -  t  ’  ■  j/.t b;'.'  • 


University  of Alberta 

Department  of  History 
Indexing  Document 


114 


1 


e 


I 

c 

* 

H. 

«< 

* 

E 


CO 


2 


cn 


M.  ^ 


co  a.  o! 

d  o' 


dTj 

O 


*5  <n  d  <r 

O  o;  a  o 


Figure  A. 2 
Coding  Sheet 


V  f  ?■) 

115 

participant  chose  index  terms  for  the  articles.  If  the  need 
arose  for  a  new  index  term,  the  term  was  added  to  the  list 
of  authorized  index  terms.  After  indexing  several  articles, 
the  list  was  reviewed  and  synonymous  words  were  combined. 

It  was  observed  that  the  number  of  index  terms  added  to  the 
list  decreased  as  more  documents  were  indexed.  About  100 
articles  were  indexed;  all  but  two  appeared  in  the  journals 
Agricultural  History  and  Saskatchewan  History.  This  sample 
averaged  6.7  index  terms  per  document  from  a  total  of  250 
index  terms.  The  sample  size  used  to  test  the  system  was 
relatively  small.  Of  the  100  documents,  44  were  selected 
to  test  the  system.  Not  all  of  them  could  be  used  due  to 
the  lack  of  memory  available  to  the  SARA  system.  The  test 
documents  had  120  different  index  terms  and  32  different 
authors  from  two  journals;  59  of  these  terms  were  used  in 
the  manually  generated  hierarchy  of  term  relations.  All 
terms  do  not  necessarily  appear  in  the  hierarchy  since  some 
terms  are  unrelated  to  the  other  terms.  Following  is  a 
list  of  the  terms  used  . 

Cue  words  (referred  to  as  general  terms): 

ALM  (archives,  libraries,  museums) 

BIO  (biographic  articles) 

CUL  (cultural  life) 

ECO  (economic  life) 

EDU  (education) 


•  •  <  Ic  .’it  -  '  '  o'fO::  h  6 


■ 

' 


Cue  words  (continued) 


FAM  (family  and  genealogy) 

GEO  (geography) 

IND  (Indians) 

IRL  (international  relations  and  law) 
LAN  (land  and  agriculture) 

LIT  (literature) 

MED  (medicine  and  public  health) 

MET  (methodology,  research  methods) 
PER  (period) 

POL  (politics,  government) 

POP  (population,  immigration) 

REL  (religions  and  churches) 

SOC  (social  history,  structure) 

URB  (urbanization,  communities) 


Concepts  (referred  to  as  specific  terms): 

ADMINISTRATION 

AGHIST 

AHENAKEW-CE 

ALBERTA 

ARCOLA 

BALCARRES 

BENNETT-TW 

BETTER-FARMING-TRAINS 


BOCKING-DH 


■ 

•  -■£: 


•  V 


Concepts  (continued) 


BROWN-GW 

BRUNO 

BUCK-RM 

CATHOLIC-SETTLEMENT-SOCIETY 

CHURCH-GC 

CHURCH-OF-ENGLAND 

CLERK-OF-LEGISLATIVE-ASSEMBLY 

CLIMATE 

CLOTHING 

COLLEGE- OF- AGRICULTURE 

COMARADES-OF-EQUITY 

CONSTITUTION 

CRIME 

CULTIVATION 

CUSTOMS 

DAHLMAN-A 

DAILY-PHOENIX 

DAIRY 

DIETARY 

DIRECT-LEGISLATION-LEAGUE 

DOCTOR 

DOERFLER-B 

DOMINION-LAND-SURVEY 

EAGER-E 


EAGLE-LAKE 


SHUTJUOlHdA-'iC-aoajJOO 


> 

-q\  ut 


118 


Concepts  (continued) 

EASTEND 

ELECTIONS 

ENTERTAINMENT 

EXPLORATION 

EXPLORERS 

FEDERAL- GOVERNMENT 

FORT-LIVINGSTONE 

FRENCH- CANADIAN 

GRAVEL-LP 

GRAVELBOURG 

GRAYTOWN 

GREENE-DL 

GUERNSEY-GF 

HAMILTON-ZM 

HANSON-SD 

HAULTAIN-FWG 

IMMIGRATION 

INDEPENDENT 

INDIAN-DAY-SCHOOL 

INDIAN- SCHOOL 

INTERNATIONAL-BOUNDARY 

JOHNSON-G 

KIRK-LE 

KLAUS-JF 


KOESTER-CB 


' 


.  '  - 


Concepts  (continued) 


LANG 

LEGISLATIVE-ASSEMBLY 

LIBRARIES 

LITTLE-PINE 

LOCAL-GOVERNMENT 

MACDONALD-C 

MACKAY-JA 

MACLEAN-H 

MANITOBAN 

MATHESON-FAMILY 

MILLER-AR 

MISSIONARIES 

MISSIONS 

MORGAN-EC 

MOTHERWELL-WR 

MURRAY-JE 

NATURAL-EVENTS 

NISBET-J 

NO-PARTY-LEAGUE 

NORTH-QUAPPELLE 

NORTHWEST-MOUNTED-POLICE 

NORTHWEST-TERRITORIES  - 

OLIVER-EH 

ONION-LAKE 


ORGANIZATIONS 


Concepts  (continued) 


PALLISER-J 

PARLIAMENT ARY- SYSTEM 

PARTIES 

PATRONAGE 

PEOPLES- POLITICAL- AS SO Cl AT I ON 

PIONEER-LIFE 

PLACE-NAMES 

POLICE 

POLICY 

POLITICAL-THEORY 

POLITICIANS 

PRAIRIE-FIRE 

PRESBYTERIAN 

PRINCE-ALBERT 

PUBLIC-UTILITIES 

PUBLIC-WORKS 

QUAPPELLE-VIDETTE 

RECREATION 

REGINA 

REID-AN 

ROE-FG 

ROMAN-CATHOLIC 

ROUMANIANS 

SASKATCHEWAN 

SASKATCHEWAN-HEARLD 


araij-flaaifoi* 


•  ■  r: Am'A'jA  LfH 


' 


121 


Concepts  (continued) 

SASKATOON 

SASKHIST 

SCOT-W 

SPAFFORD-DS 

SPORTS 

SPRY-IM 

SPY-HILL-MUNICIPALITY 

ST-STEPHEN-RM 

STANLEY-MISSION 

STEGNER-W 

STEWART-EC 

SURVEY 

SUTHERLAND-W 

SWAN-RIVER-BARRACKS 

TERRITORIAL- GRAIN- GROWERS -ASSOCIATION 

THE-LEADER 

THOMPSON-WP 

TRAVEL 

TULLOCH-C 

TURNER-AR 

UNIVERSITIES 

UNIVERSITY-OF-SASKATCHEWAN 

URBAN 

WEBSTER-EE 

WILLOW-BRANCH 


WOOD-MOUNTAIN 


t 


' 


'  '  '  -  '  ;  ,1:  ,PJ 


122 


The  documents  used  to  test  the  system  appeared  in  the  two 
journals  Saskatchewan  History,  volumes  9  to  19  inclusive 
and  Agricultural  History,  volume  28. 

The  relationship  between  the  terms  was  manually  set 
up  by  the  history  participant „  An  adaptation  of  the 
hierarchy  published  by  the  American  Bibliographic  Center 
(1965)  was  used.  Figure  A. 3  illustrates  the  terms  used 
by  the  SARA  system.  This  hierarchy  is  relatively  simple 
due  to  the  small  sample  size  of  documents  and  index  terms 
used.  It  is  characteristic  of  the  social  sciences  and  the 
humanities 3  such  as  history,  that  the  hierarchy  becomes 
much  more  complicated  as  the  number  of  index  terms  increases. 
Hence,  more  research  is  required  with  a  larger  sample  size 
on  the  approach  taken  in  this  thesis. 

If  an  index  term  is  to  be  added  to  the  system,  two 
steps  must  be  taken.  First,  relationships  between  the  term 
and  all  other  index  terms  must  be  determined  and  the  term 
must  be  manually  entered  in  the  hierarchy  if  applicable. 
Second,  the  term  must  be  placed  in  the  thesaurus  and  the 
hierarchial  representation,  TREE ,  within  the  computer. 
Presently,  no  routines  exist  which  allow  easy  modification 
of  the  data  within  the  computer.  However,  such  routines 
could  be  coded.  Addition  of  documents  to  the  system  is 
easily  handled  by  using  the  STORE  subsystem. 

The  selection  and  indexing  portion  of  the  SARA  system 
is  in  the  development  stages;  it  is  unpolished  and,  to  a 


■ 


123 


large  extent,  untested.  More  research  into  this  portion 
of  the  system  would  be  beneficial.  For  example,  extensive 
testing  by  members  of  the  Department  of  History  may 
indicate  either  areas  of  improvement  in  the  choice  of  index 
terms  and  methods  of  indexing  the  documents,  or  a  weakness 
in  the  hierarchial  structure. 


' 


HISTORY 


125 


w 


s 

O  >H 
HOW 

s  s  > 
< 

1-0 


o 

Q 


PO 

03 

CO 


CO 

PO 

H 

PO 

■O 

i-O 

Oh 

X 

W 


O 


co 

W 

H 

Eh 


O 

CO 


Q 

X 

H 


S  i-0 

<  o 

H  >H  O 

■Q  <  00 
S  Q  O 
H  CO 


TO 

O 

PS 

SO 

•H 

-P 

C 

o 

o 


on 

o 

< 

CD 

Jh 

PS 

M 

•H 

Ph 


126 


I 


I 


>H 

•  •  o 

>— 1  i — i 

o  01 

Oh  O 
Pl, 


o 

M 
Eh 
< 
K 
Eh 
••  CO 
Ol  H 
O  S 
Ph  H 


Q 

< 


<  O 

••MO 

t=>  Q 

—  Q  S  O 
W  I — I  CO 


co 

s 

o 


•  •  co 

O  CO 
—  W  M 
Pd  S 


127- 


to 

2 


-a 

o 


03 
■  CC 
=> 


-3E-* 

O  CO 
O-  2 
O 


u 

3 

tfl 


t- 

CO 

2 

O 

o 


1 _ 


2E 

Q 


I 


128 


oc 

w 

••  w  w 

O  s  fe 

-O  O  M 
1  WH  J 
I  P-i 

I 

I 


I 


H 


O 

H 

Eh 


PC 


CD 

3 

C 

°rH 

-P 

C 

o 

o 


on 

« 

< 

CD 

U 

a 

hO 

•H 


129 


s 

o 

M 

EH 


W 

Ph 

M 

hJ 


CO 

Eh 

PC 

O 

0-i 

co 


CO 

s 

o 

M 


CO 

H 

H 

PC 

< 

o 

H 

CO 

CO 

H 


Figure  A. 3  (continued) 


o 

APPENDIX  B 


BLOCK  DIAGRAMS  AND  LISTINGS  OF  ROUTINES  IN  SARA 


SARACNL 


PRINT 


131 


STORE 


■ 


' 


132 


STORE  ( cont . ) 


FIND 


■ 

' 


133 


FIND  ( cont . ) 


Output 
START  SCAN? 


Input  Y J  h  ) 

YES  or  NO  \^_y 


,N 

Output 

RE-ENTER 

REQUEST 

© 


_ i 

f _ 

Call  SCAN 

c _ 

Output 

COUNT  =  XX 

l 

l) 

■ 


13*1 


CODER 


EXPAND 


RPOL 


io  3uqnl 


■ 


tsbU  bbA 


135 


OPDN 


SCAN 


APPENDIX  B,  CONTINUED 


PROGRAM  LISTINGS 


VSARACNLlUlV 

V 

SARACNL 

Cl] 

SARA 

V 

V5i4i?A  [□  ]  V 

V 

SARA  ;T 

[1] 

' FULL  EXPLANATION?  YES  OR 

C  2  ] 

-+(BEND  T+INPW)/ 0 

[3] 

-Kni]='tf»  )/L  1 

C  4  ] 

XPLAN  1 

[5] 

LI ? 1  GO  8 

[6] 

SRANGE+\ppT 

[7] 

T+7p(T+INPV})  ,7p  ’  » 

C8] 

->(  h/Tl  ^4]  =  ,FJi7L,  )  /FINDL 

[9] 

+(*/Tl\5l='PRINT'  )  /PRINTL 

[10] 

+(  a  /2*  [  i  5  ]  =  » STORE  »  )  /STORES 

[11] 

-^(A/TC\3]  =  S  P7A7Z7  *  )  /  FINISHL 

[12] 

'INCORRECT  COMMAND ' 

[13] 

+L1 

[14] 

FIND! %  FIND  Tl5l='N' 

[15] 

-*-Ll 

[16] 

PRINTL  t  PRINT  T  [  6  ]  =  » A7  * 

[17] 

•*L1 

[18] 

STORE! t STORE  T[6]=ftf9 

[19] 

+L 1 

[20] 

FINISHL s 'ALL  DONE  8 

V 


X&J&  [Jt] 
* 


vCQ3XSUfcV 

xaxa  v  . 

’ov.  ro  zw  v/.)lr«\aw  aa\n* 

i  .vixa^i 

1  »qV,(CnViI>T)qT^  C  V  3 

CmiftR\  (  •  tRlRR  *  «t  e  f  3Tt\ A  )«• 

£Wmi\<  'dl^’-CeillXAK 

’SRMMOO  TOfcRSKmi* 

. 

xa«-  ter. 

•R¥.oa  ddx* :  aRsiRiR  [os] 


137 


V  (7002?/?  cm  V 

V  VOCODER  X ;T iWT IS II iU ;W ;M IMP  IP 
Cl]  £-*-7-<-iO 

[2]  CODEROQi' EQUAL  WEIGHTS ?' 

[3]  -►(  BEND  Q+INPH})/  0 

[4]  $+-8iV8=e[l] 

[5]  C0DER13 iT+' ( 8  ,  (3MU)  ,  8 ) 8 

[6]  9)/r)9i0 

C  7 ]  +((+/T=' ('  )=  +  /T='  )') /CODER 14 

[8]  8 UNMATCHED  PARENS 8 

[9]  +C0DER13 

[10]  CODERm%M+S\(S+v/To  «  =  ’  (va~< <*=£>)  8  )/T 

[11]  MP+(~S)\(~S)/T 

[12]  I-*-  0 

[13]  V+-  ( ~(  ”l  SH  S)aS+M='  8)/M 

[14]  7«-(  +/ (  7°  o  =  8  (va~<<^  =  >>)'  )x((p7)si2)p-  1+i12),i0 

[15]  -*(  0*p7)/Ctf££i?99 

[16]  7«-09i0 

[17]  CODER  9  9  :  U+MP 

[18]  CODER02°,-*(0  =  pUl/0 

[19]  -*(0  =  p£M~(p£Oa  1  +(£/*’  8)il)/i/)/0 

[20]  J<-J+l 

[21]  W+(  (pi/)a“l  +  £/i  8  ’)/£/ 

[22]  £M~(p£/)a£/i  8  ')/[/ 

[23]  +(~A/(  4pf/  9  8  8  )e  8  012345678  9  8  )/CODER05 

[24]  I+NUM  W 

[25]  -*-(  210  0<r)/Ctf££i?15 

[26]  -*»(  O^J^JtIOOOO  )/CODER12 

[27]  CODER 0  5;  +  (  l*I«-4£/Ttf  W) /CODER 0  7 

[28]  CODER1 5  s ’ ILLEGAL  TERM  -  89V 

[29]  +CODER 00 

[30]  CODER07 i  +  ( (~§)va/^€ 8 0123456789  8  ) /CODER  0  21 

[31]  CODER20%' WEIGHT?  -  \W 

[32]  WT+((WT*'  8  )  /  )  ,  i  0 

[33]  -*(  A/1 1*£«-8  012345  6  78  9  8  i  WT) /CODER08 

[34]  8+7£  tfi/Aftf/MLS  CWLJ8 

[35]  +CODER20 

[36]  CODER08  ?-*•(  0 . 0001>WT*-(NUM  WT  )  xl£~6  ) /CODER12 

[37]  ’<99  OtfLJ8 

[38]  +CODER 20 

[39]  CODERO  21  iWT*-lE~6 

[40]  CODER12  1VIV1OI+-Y+WT 

[41 ]  +CODER02 


V 


■  s a xw* * s  o  oasaos 


\t»e8T3a*i£sio»»( 1  •  ^q^)\a->4.  ces] 
e^Vv3qo^\(^ooi5V 
toQM'tMX’K  ■  aO’.SQOS 

■ 

c  '  •  \c,  «• 

. 


SOfcSaO^ 


138 


VEXPANDlUlV 

V  X*-P  EXPAND  YiMiTiViR 


[1] 

X+\Q 

[2] 

EXPAND03i+(0=pY)/EXPAND05 

[3] 

-*■(  (  l  +  py)=iM  Y>1)\1) /EXPANDOb 

[4] 

-►(  0=r-*-fP2’[y[P]-^l  |  YlRll) /EXPAND  14 

[5] 

-*■(  (P=l )  ,  (P=2  )  ,P  =  3  J/ETP^PPll  ,EXPAND12 

.EXPAND  21 

[6] 

PXP4PP21  s  MOPPET  2  ;  ;P] 

[7] 

EXPAND13  s-K  v/Af )  /EXPAND02 

[8] 

EXPANDS  zX+X ,  J[  iP] 

[9] 

+ETP4PP06 

[10] 

EXPAND02  J[  i“l+P] 

[11] 

->(  l  =  +  /M)/PZPi4PD15 

[12] 

U+-W+  (  v /Tpt °  .  =Af/  ipAO/ip^Pr 

[13] 

+PZP4PP16 

[14] 

EXPAND!  5  s  U+-W+¥PT \M/  i  pAf 

[15] 

PJPAPP16  s  7-«-  (  (  ~27 )  x  ”  2  )  +  (  P-«-  (  2  x  p  £/  9  i0)p  0 

1  )\u 

[16] 

l,J[P],7,”ll 

[17] 

EXPANDOQz  y«-(~(pJ)aP)/y 

[18] 

+EXP AND  03 

[19] 

PZP.4PP  1 1  s M+TREE [  1  i T  %  ] 

[20] 

■^EXPAND  13 

[21] 

EXPAND12  zM^REEll  i  ;P] 

[22] 

+PyPylPP13 

[23] 

EXPAND05  ? Z-«-X 9  J 

V 


7  RPOL CD3V 

V  S+RPOL  I ; PRI ; OPS ; T %  TP 

[1]  PRI+  001121111111 

[2]  OPS+-~ l+il2 

[3]  £-<-1 p  0 

[4]  RPOL1 ;+( 0=pl) /RPOLS 
l 5]  T+Illl 

[  6  ]  PUREST  I 

[7]  ->(  a/T*OPS  )  /RPOL5 

[8]  TP*-(T  =  OPS  )  /PRI 

[9]  RPOL  2  :-►(  (  (  P  =  ~4  )  a£[  1  3  ="4  )  v  ( T  =  ~l )  vTP>  ( 5[  1  ]  =OPS  ) /PPJ  )  /  ( 
PP£L3  ,PP0L4  )[1+T£  11] 

[10]  S+~l  SH  S 

[11]  +RPOL 2 

[12]  RPOLZiS+T ,S 

[13]  +RPOL 1 

[14]  RPOL 4  s SUREST  S 

[15]  +RPOL1 

[16]  RPOL 5 : S+S , T 

[17]  +RPOL1 

[18]  RPOL  6  :  SUREST  (  -  (((<S'=0)/ip5)-l)  )SH  S 


V 


ncmux^rav 

0  J-*X 

aoaEXdxxxUqsoKieoaxxEXx 

aOdEX<m\<X/(X<l)+ft*(Xq*X))«-  [63 

<»xaxx<ixx\([[ft3ti  r»vi-n\]i3^^=oK  [♦»] 

i  sdvtxdxx,  sxqex<ue.  xx<m«ixx\(  e=d,  (s^.u^))*- 

C 1  n  s  3  :  x  savixdxs  [  a  3 


£  OGE XGXE\  (H\v  )+:  6X GEXGXX 

CfU3t,X*-X:*I<lXX<m  C83 
doanx<m*-  c  e  3 


[fttt“/3I.X'*‘X:£OG%XGXX  [OI] 


6XGXXGXX*-  [££3 
T,x-*X:eoGEX<u,a  j;es] 


a  igexgxx\  ( a\+*  r  >♦  C  it  3 
Tl'i'C  Q  /  \  ( W Q  /  \\ft*  .  •  *sd*t  \  v  ) +U^V3 

dlGXXTW*-  [613 
s  2 XGEXGXX  [  4»  X 3 

\J/(X  0  q(0/#dqxS)+t)*(S~K<1-))^td*aXXTW  [2X3 

rx*~,N,[5m.x~,*-*  [a  1 3 
nCftoUq  )-)-*!:  30GEXGXX  CTI3 

cogxxgxx*  [ex3 
C|t|i3**f&»VLs  IXGEXGXE  [ex] 

. 

Ct*4X3XXX***’tt:SXGllX<lXX  [XS3 

/ 

VtDWWW 

■ 

XXXXXXXSXXQO  +IEG  [X] 

SXx+X'-^dO  "[S3 
OqX+G  [63 
aiO^\(IqsO)^:iaO<ift  ..-^[^3 

e  t  n-1 

i  iasa-1  C  a  3 

8G0<m(  T\AV  [ 1 3 

iswvcdia  t>  ~Tt  c 83 

)\(IEG\(EGQ=[l3E)<GTlv(x“sT)v(4i~*[x3E*(*i's,S)))'*':SGCm 

[  xx’itt  x  3  ( eicm 

E  EE  X~^E  [0X3 
sacm-  [  xx  3 
E,T^E:ea<m  [sx3 


XGOGE-*-  [6X3 


E  TEXft+Es^GOGE  [**X3 


I.E-^EiadOGE  [3X3 


E  EE((X-(Eq/\(0=E)))-ma*^:3;iOGE  [8X3 

V 


139 


V  FINDLUlV 

V  FIND  X;TiW’9V 

[1]  +X/FIND 01 

[2]  ' FIND  FULL  EXPLANATION?  YES  OR  NO' 

[3]  -*•( BEND  T+INPE)/0 

[4]  ->( TC  1  ]=  'N'  )/FIND01 

[5]  XPLAN  2 

[6]  FINDOliW+V+CODER  X 

[7]  +(0=pV)/FINDll 

[8]  FINDOG s' EXPANSION?' 

C 9  ]  -+( BEND  T+UNP®)  [i3])/0 

[10]  T+-  (  (  a  /  «  NON  '  --  T  )  ,  (  a  /  '  SPE  '  -T)  ,  (  a  /  »  GEN  '  =T )  ,  A  /  » f?EL  ’  =  2* )  / 

0  12  3 

[11]  -►(  0  =  p2\  lO  )/FIND0  6 

[12]  -+(T  =  0  )  /FINDQ^ 

[13]  EXPAND  V 

[14]  FIND  0  4  s  W+RPOL  W 

[15]  ' START  SCAN?' 

[16]  +(BEND  T+INPl D)/0 

[17]  +  (T[1 ]= * Y' ) /FIND05 

[18]  * RE-ENTER  REQUEST 9 

[19]  -+FIND  01 

[20]  FIND05 iC+SCAN  W 

[21]  -»■(  1*C) /FIND03 

[22]  »F  ERROR* 

[23]  ¥ 

[24]  XWP  0 1 

[25]  FIND03 s ( ( ' COUNT  =  ?);C) 

[26]  0PZW 

[27]  FINDlh'MORE  REQUESTS?' 

[28]  -►(  (T[l]=«tf9  )vBEND  T+INPB)/0 

[29]  ^(A/9S4M2?9=X[i4])/F.rtfP0  6 

[30]  +FINDQ1 

V 


vccnaviVi? 
MiViiia  sm 


’i"*  *•*  A  7 

»OVl  ftO  ?/&*  SViOITAfcKClTOl  iAUX  <WIV 

OXCtmi+l  QV11SLK  [6] 
iaani^\(^'=5i3i>- 


x  xocsxvi 


1  jCGlIXV  Ol  q«  Oi¬ 
ls] 

OUCftlXirt  £dl] 

eo<mx\( »x»*Lt ]*£)«- 

wm-w  ceil 

MWW* 

VI  HX^-3f20<mX  02] 

eo<mx\o*i“)- 

towv  [4*23 

(Oj(*  *  T%\JOO»));eoa%l,i  £22] 

» s^om* :  twwi/i 
outran.-1*  (msv(  »v*£m))-  £823 

JoeJ 

■ 

' 


140 


VPCi4tf[[]]V 
V  C+SCAN  XiTiYiZ 

[1]  WVPT+ l»iO 

[ 2 ]  i 0 

[3]  +( l<pX)/SCAN010 

[4]  ftV+(l£T)/T<-GETVEC  X 

[5]  +SCAN06 

[6]  SCAN 010zX+X90 

[7]  SCAN  01  s!T-<-Z[l] 

[8]  -^(£5T<0)  ,T=0)/SCANLTZtSCANEQZ 

[9]  Z«-“l  SH  X 

[10]  +SCAN 01 

[11]  SCANLTZt+(  (  llZT)vl*T+-\T) /SCANSYN 

[12]  +((  (T*4 )AZ=~l+pZ,  iO)v(pZ,  \0)<LZ+X\0)  /SCANSYN 

[13]  +(T<^)/SCAN05 

[14]  +(v/2 <ZP«-Z[(  l+pZ),p XI) /SCANSYN 

[15]  (GETYR  XL  1 ], YP )  +  0 . 00 0 1 x [ /I | 10 0 0 0 xjp 

[16]  +SCAN 08 

[17]  SCANO  5  s  Z+Z  -  ZP«-1  |  Z+GETVEC  Z[pZ] 

[18]  -*(  T  =  H) /SCANOT 

[19]  Y+-Y-YP+1  |  Y+GETVEC  Z[~l+pZ] 

[20]  yP[7]^JP[F-*-(7^pZ)/ip7]  +  ZP[  (7^pZ)/7-«-ZiJ] 

[21]  +(SCANOR  9SCANAND)l~l  +  Tl 

[22]  5<74tfP2Z2Z«-i?P,ST  Z9  lO 

[23]  +{2*pX ,1) /SCANSYN 

[24]  SCAN  0  6  I  C+-pftV  t  i  0 

[25]  +0 

[26]  SCANORiftV+ftV  AY+YP)  A~v/ll]Yo  0=Z) /Z+ZP 

[27]  SCAN 0  8  zX+-REST  1  SH  X 

[28]  f/F«-($F>l)/$F 

[29]  PC^i70  9  zX+-(REST  1  5P  Z)  »~l  +  p  WVPT 

[30]  X+-REST  X 

[31]  WVPT+WVPT ,1+pftV, \0 

[32]  +SCAN 01 

[33]  SCAN AND s  WV+$V  ^ (v/Joc=Z) /Y+YP 

[34]  +SCAN 08 

[35]  SCANOTi+(  0>pZP9  lO  )/SCANOT01 

[36]  ZP«-0 

[37]  SCANOTOl  zffv+ftv,  (  L  /ZP)  +  (~v/Jo  0=Z)  /Y+xfdNDOC 

[38]  +SCAN 09 

[39]  SCANSYN  zC+~l 


V 


vt03EkOfcv 

X;X^iX  EkOE-O  V 


[tllfttlOim 
S&SEXOE.S^EkOENCO^CO^))*- 

^%k^\(0/X-*-S3E(0/,Xq)v(0#,XQ+X"'«SA(4»iiTi)H^ 

a$E>tifi\  <*«*>♦ 

EXEEkOE\(CXq*(Xqti")3X><iT>£\v>*- 


CXq]X  1  i+EX-S+X  sEOEkOE  c Vi: 3 


CXq+i~3X 

LtxS^M\(SqiV)3<iS+CXQi\(Sqi\*)^V3«tX<V3‘iX 

C*s+i~3<a%Afckoaf*o%k^)+ 

It&lft&lYU'XQ**)* 

Ol  30**0* 

' 

,is+s\(s«..ut]\v-)l{^ti) •tfk'vflittwuro  c 323 


X  EE  t  T&aV^eOEkOE 
>,(X  EE  i  TEEEj^XieOEkOE 


TTJ*q+i 


U€3 

EX*X\ (  X*  .  »X\ v  ) # :  SEMITE 

SOEkOE*-  [^£3 

X0X0Ek0E\<0/ t«iSq<0)4*s’IOEkOE 

O-^ES 

. 


eoEkOE* 
i~V3*EXEEkOE 


v 


141 


V GETMEM [Q]V 

V  V+T  GETMEM  X\ U ;N 

[1]  V+\Q 

[2]  GETMEM01 i V+V ,UptiEMlX ; \U+T-ll 

[3]  +(  999999*X+$EMlX  °tTl  )/GETMEM01 

[4]  -K0  =  p7)/0 

C5]  +  (  A/£/«-10000*7)/0 

[6]  F-<-(  (  ~£/  )*V)  +  U\(N-10Q0Q  \N+U/V)*10000 

V 


VGETVEC CD]V 

V  V+-GETVEC  X\W 

[1]  V+\0 

[2]  ->-(0  =  pifg  i  0  )  /  0 

[3]  X+X-W+l \X 

[4]  +(Ua)A~I£MM)/0 

[5]  -»-(  (X=0)  9X<0) /O  tGETVEC01 

[6]  X+PT 1X1 

[7]  V+W+ftC  GETMEM  X 

[8]  +0 

[9]  GETVECQ1 iV+ftVlW+  1+WVPTLX- 1 ]  + i WVPTlX] -WVPTl (X+ | X ) - 1 ] 
] 

[10]  WVPT+REST  1  SH  WVPT 

[11]  ftviwi+o 

[12]  &V+(0 *WV)/ftV 

V 


VGETYRl D]V 
V  V+-GETYR  X\Y  %Z 

[1]  GETYR01+GETYR02+0 

[2]  V+iO 

[3]  -*•((  J>6)vi>J-f  4+|X[l])/0 

[4]  +(  l*pZ+-[0o  0  001  +  1  000  Ox  (  (1*U[  2  3]  )/Xl2  3]  )  ,  i0)/0 

[ 5 ]  +(SCANLT 9SCANLE 9SCANNE tSCANEQ tSCANGE ,SCANGT ) [J  ] 

[6]  SCANGTi  V+V9  (  v/($MEMl  ;  2  3  4]*999999  )A?M£M[  ; 

2  3  M>Z)/$MEMZ ;1] 

[7]  +GETIR 01 

[8]  SCAtf LT  s  7-*- 7 ,  (  v  /  (  (  $MEMl  ;  2  3  4]*0  )  A$MEMl  ;  2  3 
4 ] *1  )  f^YMEMZ  ;  2  3  4] <Z  ) /^£’Af[  ;  1  ] 

[9]  +GETYR 02 

[10]  SCANLE IGETYR02+GETYR12 

[11]  -+SCANLT 

[12]  G£TJZ?12  s+SCMtftf# 

[13]  SC4tfG£jG£Tyi?01^G£TJi?ll 

[14] 

[15]  G£TJi?ll 

[16]  SCANNED V+Vt (~EQF  Z) /iMEMli 1] 

[17]  -+GETYR2Q 

[18]  SCANEQt V+V, (EQF  Z)/$MEMl ;1] 

[19]  GETYR20  ?-»-(  0  =  p7-<-7#  i  0  )/0 

[20]  V+RMVCM  V 


V 


vtujuawtaov 

IWUsX  HattW  s+v  7 
[X-T-MJ t  ilk 3¥\lftq\J , V+V  :  XGWMttW 


x  OMViwa\ ( C 1  a  3  vilft***  e  eeaee  )*► 


OOOOX*a\\3^|OOOOX-%)/\Jt(V»(\)-))-M 


[0X3 

C-lX] 


mu  »a  i  iaia-i<m 
o-xnwa 

[  £i  3 

7tt33HITl&7 
SiXjX  7 

o^toai'ii^  mm 

i3^vitA(eeeeee*[M  e  s  |]uiit)\v)vwt*tam 

C.£;jy/AH*\(S<i>  6  £v 
ioam^ 

e  £  j3MltttA(0*C*  6  £  s3ttfrt))\v)t3V&3&*xra 
[x*  3tt&\rt\u>i>  e  £  }3%^a(x»C4* 

£  Xftr£SS-»*£0ftXT3K> ;  SJTIX^ 

ta*xoa«- 
sxsnr-io 

XXftTW&+XOSll*JU> :  %WXOa 

XtWXOfc*- 
&s%xo^*':ixftrm 
CXj3tt3ttt\(X  W-),H|WIKW 

[Xj3^t\(s 

0\<0*,V^q«0)«-lO$ftTm 


Nf 


142 


vopzwmv 

V  OPDNiTil 

[1]  -Ki=ptfy,i)/o 

[2]  'OUTPUT  DOC  NOS?' 

[3]  +  (  »tf»=(JTO!])[  l])/0 

[4]  T+ltiv 

[5]  -*(  A/J«-#V<10000  )/OPDN3 

[6]  tfV+(  Jx^F)  +  (~J)x!T-l  |  10  0  00 

[7]  OPDN3iI+0 

[8]  OPDN 1  !-►(  (  p ^7 »  i  0  ) <T-«-J+  5  )  / OPDN2 

[9]  (TR  2  5  p2,,#7[iZ,«-J+  5+i5]) 

[10]  'MORE?' 

[11]  H'N'=(INPB)lll)/0 

[12]  +OPDN 1 

[13]  OPDN2t (TR(2,T)p( I+\T) ,  #7[It  iT«-(  p#J\ iO)-I«-I- 
5]) 

V 


VC03VH07 
0\(I,^q*JCK 


OOOOt*(M3\I)V£|t-l*(T~)  +  (Mfl*\)-»-v3 

o+iteiKreo 


(C2/+3  l  £  ft**) 

*  S  * 


^aaow 

OXUndmi)*'*’  >+ 

rwwo* 


/ 


. 


■  V 

. 

. 


1^3 


VPRINTl D]V 

V  PRINT  X\T\U 1V1AN ;P;J 

[1]  +X/PRINTL11 

[2]  ' FULL  EXPLANATION?  YES  OR  NO' 

[3]  -+■(  BEND  T+INPl/])/  0 

[4]  -+(Tll]  =  ' N' ) /PRINTL11 

[5]  XPLAN  3 

[6]  PRINT L 11s' 

PRINT  WHAT?' 

C  7  ]  PRINT12  z-+(BEND  U+-INPE)  /  0 

[8]  -*■({/[  1  ]  e  ’  0123456789’  ) /PRINT!  13 

[9]  'POP  NO  FIRST' 

[10]  +PRINT 12 

[11]  PRINTL13 i+(~v/V+' , ? =U) /PRINTL 14 

[12]  AN+-NUM(  (  p  i/)a  l  +  7il)/£/ 

[13]  P«-5  p  0 

[14]  P«-+/7 

[15  ]  PRINTL17  ;+(  0  >1+1-1 ) /PPIPTLl 5 

[16]  7-*- »  ,  *  =  P 

[17]  £/*-(  (~(pP)a7il)/P)  ,3p  •  » 

[18]  -*•(  A/Ul  i2]=ML'  )/PRINTL16 

[19]  P^Pv(  a/7=  ’TP’  )  ,  (  a/7=  MZM  )  ,  (  a/7=  ?  *70’  )  ,  (  A/7=fZPf  )  ,  (  a  /  * 

45  1  =  7+0[  i  2  ]  ) 

[20]  +PRINTL17 

[21]  PRINTL16  ?P«-5pl 

[22]  PRINTL15  :  -»■(  v/F  )  /PRINTL18 

[23]  PRINTLl^i' INCORRECT  PARAMETERS  --  ?  ,  0 
[2  4]  -+PRINTL 12 

[25]  PFTFFF 18  sdtf  OUTPUT  F 

[26]  -►PFItfPFll 


V 


1  OH  000’ 

fiMHuraxu*1 ,  •^\v-)>itiatf»i‘Vi 
\y\u  j^+*“o(\3qm\j%-Hx 


2  laTHUttN  ( JC  -1-1<0  )- ;  V I imi&'i 

0=  ».•«•*■ 

»  fqe.(0\(l/Mo(\5q)^))-»-\J 


[- 


\l,f  --  TOIHROOHI’ j^XdTHI^ 

**  wmo 


144 


VOUTPUTlUlV 

V  AN  OUTPUT  XiMiTP 

[1]  TP-*-^200a  (M=  1  |  *  )  il ) /M+(PERMEM,  199p  ’  |  »  )1$MPTIAN  ; 
2  ]  +  l  +  i 200] 

[2]  Tp+(~TP='~' )/TP 

[3]  +(Xlll=0)/OUTPUT10 

[4]  ( ( ( p TP) a  l  +  C^Pr’a' )\1)/TP) 

[5]  OUTPUT1  0  :  TP+(  ~(  p  TP)  a  (TP-1  a. '  )\1)/TP 

C  6 ]  +(X12]  =  0) /OUTPUT11 

[7]  ( ( ( pPP) a  1+(TP=» c * )X1)/TP) 

[8]  OUTPUTll:TP+(~(pTP)a(TP='  °  ’  )il)/TP 

[9]  +(Xl3]=0) /OUTPUT12 

[10]  C  (  (  p TP )  a  ~  1  +  (  TP  =  ’=>’)  x  1 )  / TP ) 

[11]  0UTPUT1 2 : TP+( ~( p  TP) a  C  TP- ’ 3  * )il)/PP 

[12]  -^UL^l-0) /OUTPUT13 

[13]  +(~v/M4-AN=yMEMl  ;  1  ]  ) /OUTPUTS  4 

[  1  4  J  (  C  ’  YZABS  , -  ');$MEM£M\  1;  2  3  4]) 

[15]  0UTPUTlh:Mt-'?'  ,  C  (  Cp  TP) a~l  +  (  TP-  ,l*)il)/TP)»  '=>' 

[16]  M[CM=’=>*  )/\pM]+’  ’ 

[17]  W 

[  1  8  J  0UTPUT13:TP±(;~(pTP)a(Tr^'  i*  )\1)/TP 
[191  -vC^[5j=0)/0 

[20]  (  ( ( pTP)a~l+(TP= '  |  ’  )il)/TP) 

V 


•  •  JO 
"10  Jv  7 

S (  '  :  •-)) 

•  '  .  '  :  :  *TJ0 


145 


VSTORELUlV 

V  STORE  P  ;T 

[1]  ^P/ STORE  01  v 

[2]  1  FULL  EXPLANATION?' 

[3]  -►( BEND  T+INPB)/0 

[4]  +(Tlll='N' )/STORE01 

[ 5 ]  XPLAN  4 

[6]  STORE  01 %((' DOC  NO  '  ) 'tttNDOC+ttNDOC+ 1) 

[ 7 ]  STRTIT 

[8]  STRAUT 

[9]  STRJOU 

[10]  STRXTR 

[11]  STRYR 

[12]  STRABS 

[13]  CLENUP 

[14]  'MORE?' 

[15]  -*(  » I*  =  ( ItfPC] )  [  1  ]  )/STORE01 

V 


VSTRTITl D]V 

V  C+STRTIT ; T 

[ 1 ]  C+ \0 

[ 2 ]  » TITLE ' 

[ 3 ]  T+Q , i 0 

[4]  -*(  0-pT) /STRTIT01 

[ 5 ]  Permem+Permem , '  ~ » , T 

[  6  ]  •+■ 0 

[ 7 ]  STRTIT 0 1 t PERMEM+PERMEM , ' -NONE ' 

V 


VSTRMEMlUlV 
V  C  STRMEM  T;CT 

[1]  C+PTIC] 

[2]  STRMEM01 i+( 999999=CT+ftEMlC iNC] )/STRMEM02 

[3]  +(C=C+CT)/STRMEM01 

[4]  STRMEM02 i+(ftCzCT+ttEMlC ; \fiC-ll i 0 ) / STRMEM00 

[5]  +(T=ttEMlCiCTl+T) / 0 

[6]  STRMEM03iMEM+(  (  (  pttEM)  [  1  ]  +  l  )  ,#<7)p(  t$EM )  ,T,  (  (#(7- 
2)p0  )  ,  999999 

[7]  $EML  C  ;$C]-«-(  p$EM )  [  1  ] 


V 


Til  IWIZ  7 
il\JV 

o\(EWn>t  ontU 

( x  tooav^ocKiitt;  ( »  oi  ooa* ) )  j  iovioto 


L^rj 


x o*iOTa\(U3CQWL)*,V  )•*• 


vcoimmv  I  I 

TlTIWI^O  V 

o/.CH-i  C  e  3 
XO*Vma\(Tq*0)«- 
T . » -  * *  K*USl*<UttlttTO<l  [  2  3 

133 

•  hoi-  1  xonww 


vCQlttm'rav  • 
T0}1  Mlttftta  0  V 

. 

s  o  h*wk*  \  ( [  zt ;  o  *eeeeee)*-:i  ' 

*  roniMn\(^o)+ 
e (MIWTOA  <  0  /  [ X -0$ / j  0  3W5tt-*T03t0fc  )•*•  I  S OWWMltO 

- 0 1 ) ) , *5 ,  ( VX& 4  )  q(3t«  (  X+£X  J{ q  )  )  ) ♦lift :  e OHlftftt  1 

eeeeee,(oq(s 


146 


vstrautiuh 

V  Z+STRAUT \T \C \U 

[ 1 ]  Z+ i 0 

[2]  'AUTHOR' 

[3]  U+T+t(T*  '  '  )/MD),iO 

[4]  +( 0=pT)/STRAUT03 

[5]  +(~' ,'eT) /STRAUT 0  5 

[6]  T+Tl\  l+Tx','1 

[7]  STRAUTO  5  s  -►(  1  -C^AUTH  T)/STRAUT  01 

[8]  STRAUT 02 %C  STRMEM  tiNDOC 

[9]  +STRAUT 04 

[10]  STRAUT03%U+T+'NONE' 

[11]  +STRAUT0H 

[12]  STRAUT 01 i%PT+%PT,l 

[13]  tiEM+t  (  (ptar)[l]  +  l )  JC)p(  ,&EM)  t&NDOC  AiflC- 2  )p0  )  , 
999999 

[14]  ZST+LST ,T 

[15]  ISTPT+LSTPT ,  [  plSTPT  ]  +  p T 

[16]  Pt+Pt,  (pttEM)lll 

[17]  STP/l  i/T  0  4  ;  PERMEM+PERMEM ,  ’  a  '  ,  U 

V 


VSTRJOUlUlV 
V  Z+STRJOUiCiV 

[1]  Z«-i0 

[2]  ' JOURNAL ' 

[3]  ^,rZ7i?e/C>£/0  4  s  *  ?)/7«-Q),i0 

[4]  ■>(  0-pT) /STRJOUOl 

[5]  U+ 0 

[6]  ^  eT)/STRJOU 05 

[7]  U+NUM(Te  '  0123456789*  )/!T 

[8]  -*-(  99  9  9^£/)/5!TC?i?Pll 

[9]  » VOLUME  NO  <,  99  99’ 

[10]  +STRJOU 04 

[11]  ST0PP11  \~1+Tx  '  ,  '  ] 

[12]  STRJOU 0  5s-K  1  =  04 i/TP  T)/STRJOU 0  2 

[13]  C  STRMEM  U+ldNDOC*  1000  0 

[14]  ->0 

[15]  STRJOUOl :  V-*-'  NONE  ' 

[16]  +STRJOU 03 

[17]  STRJOU02i' INVALID  JOURNAL  -  '  WV 

[18]  +STRJOU 04 


V 


t ,  1^*1  r  t  Ot\J  kWlZ 

«  (  Oq<  S-^)  ) ,  30Q%$«  (fcaft,  )  qOft,  (  Jt  +  [  t  HWSfto )  )  )  -W'  t 


NjQ;WMli*3  V 


iOYK>Wm('Sq*0>«»*  [>| 

mo\.m\<*>',  •-)* 

T\{  'eevae^.sro*  >T)vs\w-*\i 
iiSftOT&\<\iseeeeH 
•eeee  *  wuaov 

*ovj<mxfc+*  [<ncj 

'  '  ■=  .  dovi  :  v  ■  ] 


r.  . 

N.’  -  Jk'lM  0\.  CiaMAI';  £OUO>Ma 


147 


VSTRXTR CD]V 

V  X+STRXTR1T1X 

[I]  Z«-x0 

[  2  ]  » XTERMS v 

[3]  STRXTR02iT+((T*'  *  )/T«-H),iO 

[4]  -►(  0-pT)  /STRXTR02 

[5]  -+(BEND  ( !T ,  '  «)[i3])/0 

[6]  +CleC+AUTH  T)/STRXTR 01 

[7]  C  STRMEM  $NDOC 

[ 8 ]  Permem+Permem , ’ ^ ' , T 

[9]  -+STRXTR  0  2 

[10]  STRXTR01  :  *  IWLAJZ?  -  *  ,T 

[II]  +STRXTR 02 

V 


VS27?7i?[[]]V 
V  C+STRYR ;T }I 

[1]  <7«-i0 

[ 2 ]  » YEARS * 

[3]  r-^(  (  T*  *  '  )  /  T^C] )  ,  lO 

[4]  -K~(2V  *  )[l]e  *  0123456789*  )/0 

[5]  ->(  0  =  pT)/0 

[6]  HMEM+  (  (  (pteA/)[l]  +  l )  ,4)p(  9$MEM)  9!bND0C  ,  3p  999999 

[7]  +  (  '  -  «  eT)/5Ti?Ji?01 

[8]  I«-l 

[9]  STRYR0  3  i  -►(  4<J-*-J+l  )  /  O 

[10]  +  (  0  =  pT)/0 

[11]  teAf[(pteA/)[l];J]^i/Af  2*[i4] 

[12]  2,^(~(p3’)a5  )/T 

[13]  -+STRYR  0  3 

[14]  STRYROUfMEML(p^MEM)Lll;  2  3  4  ] -*- ( Z7  Af  T[  i  4]  )  ,  0  tNUM  Tl 
5+  i  4] 


V 


Xl'StTOMWV*’!  V 

o/. ca-tvc  '*:>)-*: 


D#H*\\V  ««*&*&  o  cn 


eeeeeeqe,:mviM**vtf .  )q(^(  i+[xHv^ttt<0))-**avi£ 

i 

n  \m«o,(om  HVi%)-c+*  e  £  « cj j ( m*t * ) jmivtf 1 * oax w  c*x3 


■ 

. 


148 


VSTiMBSCDDV 

V  C+STRABS iT 

[  1  ]  C-*-\  0 

[2]  'ABSTRACT ’ 

[3]  T+((T*'  *  )/2H!l)  . lO 

[4]  +  (  0>pT) /STRABSOl 

[5]  T+'NONE* 

C  6 ]  STRABS 01 i  pERMEM+pERMEM , ' 1 » , T , '  I  ' 

V 


VCLENUPIU1V 
V  C+CLENUP 

[1]  C*~  i  0 

[ 2 ]  Pmpt+Pmpt , 1 + p Permem 


V 


tOEEkll'mdq  «>><*: 


ttSViEi*!  q  ♦  1 ,  t 


■ 


149 


VAUTH [Q]V 

V  V+AUTH  WiBWiAViN 

Cl]  RW+p{W<r{W**  *)/W)9  iO 

C  2 ]  A V+lSTPT [ 1 +N1 - ESTPT C  N+ i ~ 1  +  p ZSTPT ] 

[3]  47<-(i?J/=,47)/ip47 

[4]  -*■(  0-pAV)/AUTHAUTH01 

C  5  ]  I-*-0 

[6]  4tf2,iL4y2,/70  2  s-K  (pAV9  iO)<J«-I+l  ) /AUTHAUTHQ1 

[7]  ^(~a/V=£>52,[“1  +  £527PT[  (AV9  i  0  )  [  J]  ]+  \RWl  ) /AUTHAUTH02 

[8]  V+AVII1 

[  9  ]  >0 

[10]  AUTHAUTH 01 i V+~l 

V 


S/BEND  [Q]V 

V  X+BEND  T 

[1]  X+A/'END'=(T,'  *  )  C  i  3 ] 

V 


vp$fCD]v 

V  T+EQF  Z 

[1]  T+v-/Z=$MEMli  2  3  4] 

[2]  T+TV($MEML  i3]  =  0)  a(¥MEMI  ;4]>Z)a(?M£A/[  j  4]  *9  9  9  9  9  9  )  A^MPAf 
[  ;2]<Z 

V 


VIWP[[]]V 

V  T+INP  X 

[1]  T+UX**  *)/X)%' 

V 


VffiW[[]]V 
V  V+NUM  M 

7^10l“l+  *  0123  45  6789’ \M 


Cl] 


V 


Oi,(M  (•  *«*>»*)g*»m 
C  TTT&iq  t  £  “  i  -*H  -  C  V:  ♦  I  X  C  S3 

7Xq/\(MX«Vft)-VX 
£0RTiXftTViK\<XXq»0  K 

1 OSTU AHT\J X\ (  £  t V*l> (  0 1 ,  V  x  >  )4-  SO ttTViXBTU X 


X  U  1 IU  RftXu  A  \  v  £  t  L‘**JL>  J,  0  J  JVXQ  )  J«-  S  0  <iXftTVi  X 

S0B’IVJXH1\iX\([Mft/  +  CCl](0/,VX)3W^S  +  f' 

cinM 

7 


IT  <mE*l  7 


\T)*'<m»\A+X 


S  TiS>a-*T  7 

. 

vis^A(eGeeee*»[*»i  3wih^)a(s<c^;  imuf  >A(o»Cei  )vt*t 

7 


VCCIIWLV 


vc  cmwiv 

I  W^T  7 

f,U\(»  }«OL))  -  m 


. 


. 


W3M  7 


150 


vrestiuiv 

V  X+REST  Y 

[1]  M~(py)al)/Y 

V 


VRMVCMLUlV 

V  N+RMVCM  V 

[1]  N+(NeV9 \0)/N+\[/V, lO 

V 


VStfCDJV 

V  SHZ+N  SH  V 

[1]  SHZ+(V» iO)[l+(p(7, lO)) |(ip(7,iO))-l +N1 

V 


VM[0]V 

V  Z+TR  M 

Cl]  Z<-(  pM)l2  l]p(  »M)C,(i(p^)[2])o0  +  (p^)[2]x  l+i(pM)[l]] 

V 


VSORT C03V 
V  R+KEY  SORT  ITEM 

C 1  ]  R«- ITEML  (  +  /  ( KEY  °  „  <KEY  )  +  ( KEY  o  .  =KEY )  Ai?  o  0  $R  )  x  r+  x  pKEY  ] 


V 


n 


[1UX-((0/#X)q/)  l<(Or#.N)q)  +  XKOj,N)-Saa 


VC03&1V 

' 


VCCHTftCteV 
v^S'^i  Tft.Qu  V 


151 


Cl] 

[2] 

[3] 

C  41 

[5] 

[6] 

[7] 

[8] 


VXPLANIU^V 
V  V+XPLAN  P 
V+- 1  0 

+(XPLANl tXPLAN2 ,XPLAN 3 tXPLAN 4 )  CP] 

XPLAN1 i ' 

YOU  APE  UNDER  TEE  CONTROL  OF  THE  MAINLINE  ROUTINE' 

9 YOU  HAVE  3  ROUTINES  AVAILABLE  FOR  PROCESSING ' 

9 IN  ORDER  TO  ENTER  THE  REQUIRED  ROUTINE ,  TYPE  ITS  NA 
ME  AFTER' 

'  ' 'GO' '  IS  TYPED ' 

'IN  ORDER  TO  RETURN  CONTROL  TO  THIS  ROUTINE ,  TYPE  " 
END' '  AT' 

'  ANY  INPUT  TIME 


[9] 

Cio] 

Cli] 
C 1 2  ] 
C  13  ] 

C14] 
C  1 5  ] 
C  1  6  ] 
C  1 7  ] 
C  1 8  ] 
C  1  9  ] 
C  2  0  ] 


C  2 1  ] 
C  2  2  ] 
C  2  3  ] 
C  2  4  ] 

C  25  ] 
C  2  6  ] 
C  2  7  ] 

C  2  8  ] 
C  2  9  ] 
C  30  ] 
C  31  ] 
C  32  ] 
C  3  3  ] 
C  34  ] 

C  35  ] 


* FIND ' 

ACCEPTS  INPUT  REQUESTS  AND  RETURNS  DOCUMENT  NU 

MBERS ' 

'  SATISFYING  THE  REQUEST' 

1  PRINT  9 

LISTS  ANY  PART  OF  THE  DOCUMENT  WITH  THE  GIVEN 
DOCUMENT' 

'  NUMBER' 

' STORE ' 

'  ALLOWS  STORAGE  OF  DOCUMENTS' 

'END  ' 

'  RETURNS  CONTROL  TO  THE  APL  SYSTEM' 

-*■0 

XPLAN2 s ' 

IF  WEIGHTS  ARE  TO  BE  ATTACHED  TO  EACH  INDEX  TERM ,  TY 
PE  ' ' YES ' ' ' 

'  WHEN  REQUESTED  ( EQUAL  WEIGHTS?) ; 9 

9  ELSE  ,"N0"  ' 

'THEN,  TYPE  THE  REQUEST  IN  STANDARD  BOOLEAN  FORM ' 
'WHEN  ' 'EXPANSION?"  IS  REQUESTED ,  ONE  OF  FOUR  WORDS 
( OR  THEIR ' 

'  ABBREVIATIONS  AS  IN  THE  FOLLOWING  PARENTHESES ) ’ 

0  IS  EXPECTED ' 

9 

NONE  (NON)' 

'  NO  EXPANSION  OF  THE  REQUEST' 

' GENERAL  ( GEN  )  • 

5  INCLUDE  GENERAL  TERMS  IN  AN  OR  RELATIONSHIP ' 

' SPECIFIC  ( SPE )' 

INCLUDE  SPECIFIC  TERMS  IN  AN  OR  RELATIONSHIP ' 

' RELATED  ( REL )' 

'  INCLUDE  RELATED  TERMS  IN  AN  OR  RELATIONSHIP 

9 

'WHEN  REQUESTED  "START  SCAN?"  TYPE  ''YES''  OR  ''NO' 


x  xkxxx-x  v 

cn 

cx3<  *»xkxxx#  exkaxx,  sxkaxx,  ixk;m)+ 

kx  xxx  xxxx  .xxxxvsox  oxxxooxx  xxx  *xxxx  ox  xxoxo  xx» 

"  XXX  X  .XXXTOOX  XXXX  OX  XOXXXOO  XHOXX*  OX  XXQXO  XX 1 

•xk  » 'axx 

XWLX  XUXXX  XXk 


[83 

ox  xxxwmo  xxxoxxx  oxk  xxxxooxx  xoxxx  xxxxook 

•aXXXtt 

’xrdxooxx  xxx  ootxxaixko 

'XXIXX' 


XXNXO  XXX  XXXV*  XXXttOOOO  XXX  XO  XKkX  XXk  xxxxa 

•xxxxwjooo  xo  xokxoxx  xvoxak  *  t.ai] 


•ttxx^xx  axk  xxx  ox  aoxxxoo  xxxvma. 


Wl 


•XXX*  CXI3 

XX  ,WXXX  XXOXI  XOkX  OX  QXEOkXXk  XX  OX  XXk  XXXOXXV  XX 

"•Mt*1  XX 

’{(XOXXOXXVt  XkOOX)  QXXXXOOXX  XXXVI 

•  "QX'^XEaX 

'WXOX  XkXXOOX  OXkOXkXX  XX  xxxosm  xxx  xxxx  ,xxxx 
XOXOV  XV5 OX  XO  XXO  .OXXXXUOX*  XX  ,♦ ’<UOXXXkXXX*  *  XXXV 

‘(XXXXXXXXXkX  OXXVOXXOX  XXX  XX  Xk  XXOXXkX^XXEXk 

«OXXOXXXX  XX 

. 


[IS] 

U£3 


✓ 


*(XOX)  XXOXV 

■xxxogx*  XXX  XO  XOXXXkXXX  ox 

’ (XXO)  XkXXXXO 


*(XXX)  oxxxoxxx 


'XXXXXOXXkXXX  XO  Xk  XX  xvaxx  oxxxoxxx  xooaoxx 
XXXv.XOXXkXXX  XO  Xk  XX  XVIXXX  OXXkXXX  xooaoxx 
'OX”  XO  "Zll"  XXXX  »»XXkOX  XXkXX*1  axxxxooxx  XXXV 


152 


[36] 

[37] 

[38] 

[39] 

[40] 

[41] 

[42] 

[43] 

[44] 

[45] 


[46] 

[47] 

[48] 

[49] 


*  .  IF1 

"YES'',  A  SEARCH  OF  ALL  DOCUMENTS  IS  STARTED ; 
IF  '  'NO"  ,' 

'  A  REQUEST  MAY  BE  RE-ENTERED1 

1 WHEN  THE  SEARCH  HAS  BEEN  COMPLETED,  THE  DOCUMENT  NU 
MBERS 1 

1  MAY  BE  LISTED  BY  TYPING  ''YES"  TO  "OUTPUT  DO 

C  NOS?  "  " 

1  ELSE,  "NO"' 

'THEN  "MORE  REQUESTS?"  IS  TYPED ;  IF  THE  SAME  REQUE 
ST  IS  TO  BE1 

1  EXPANDED  DIFFERENTLY ,' 

'TYPE  ''SAME''  %  ELSE,  ''NO''  OR  ''END''  ' 

-K) 

XPLANZ %  ' 

THE  FORMAT  OF  THE  INPUT  TO  THE  "PRINT"  SUBSYSTEM  I 
S  AS' 

'  FOLLOWS ' 

' DOC  NO,  OPTION  1  (,  OPTION  2)  (,  OPTION  3)  ,oo' 

'DOC  NO  IS  THE  DOCUMENT  NUMBER  CONCERNED ' 

'SIX  OPTIONS  ARE  AVAILABLE 


[50] 

[51] 

[52] 

[53] 

[54] 

[55] 

[56] 

[57] 

[58] 

[59] 

[60] 
[61] 
[62] 


'TITLE  ( TI )' 

'  THE  TITLE  IS  LISTED ' 

'AUTHOR  UU)' 

'  THE  AUTHOR  IS  LISTED ' 

' JOURNAL  ( JO )' 

'  THE  SOURCE  JOURNAL  IS  LISTED ' 

' XTERMS  (XT)' 

'  THE  INDEX  TERMS  AND  THE  PERIOD  THE  DOCUMENT  DE 

ALS  WITH ' 

'  ARE  LISTED' 

' ABSTRACT  (AB)' 

'  THE  ABSTRACT  IS  LISTED ' 

'ALL  (AL)' 

'  ALL  THE  ABOVE  OPTIONS  ARE  LISTED 


[63]  'TO  EXIT  FROM  THE  ROUTINE,  TYPE  "END''  ' 

[64]  +0 

[65]  XPLANZ" 

INITIALLY ,  THE  CURRENT  DOCUMENT  NUMBER  IS  TYPED .  TH 
ENi  ' 

[66]  '"TITLE"  IS  TYPED,  AND  THE  USER  INPUTS  THE  TITLE ' 

[67]  AUTHOR'  '  IS  TYPED,  AND  THE  USER  INPUTS  THE  AUTHOR ' 

[68]  '' 'JOURNAL' '  IS  TYPED ,  AND  THE  USER  INPUTS  THE  JOURN 
AL  IN' 

[69]  *  THE  FORMAT ' 

[70]  ’ 

JOURNALNAME ,  VOL „  XX 

i 


cm 


»...  (e  norwo  . )  (s  ,  >  t  .0% 


•aswmxos  naawjfc  ’mwma  fcfct  fci  woog’ 

aaakaikNk  sa;  fcnovrio  m'  Ce*»3 

Css  j 

'ttfl-ma  Z\  ftOTOJk  x« 

•  C  OV>  AkWJOV 


»<mfcia  ai  xatrt  skt 


'aXT^li  ^I  akHft\30\.  S35WQ*  t*\ 
XQ  tusvuma  CiOISl^  fcftt  QXk  ZW&'l  VZMY 

•<meia  a*k 


arma  s*k  xwit4b.  avotik  xat  aak 


"awa**  twi  .amraaft  ***2  «w*  tlT*  of'  Cea] 


*»*ka«flr  c 23 3 

at  .cm^  ^I  hxewii  ,mvw«oa  Twa?.H\i3  .waiww 

vwvjot.  set:  bt\j^%i  in?  <wk  rnw^  si  *'akMv\JQV»f 

'%i  ak 

'  *2kvmox  izi 


153 


[71]  «  THE  VOLUME  NUMBER  MAY  BE  OMITTED' 

[72]  ' * ' X TERMS ' '  IS  TYPED ,  AND  THE  USER  INPUTS  THE  INDEX 
TERMS , * 

[73]  *  ONE  AT  A  TIME .  TO  TERMINIATE ,  TYPE  "END"' 

[74]  1  *  * »  *  IS  TYPED,  AND  THE  USER  INPUTS  THE  YEARS  I 
N  ONE  OF' 

[75]  »  THE  FOLLOWING  FORMATS s 

t 

[76]  »  1921* 

[77]  ’  1921, 1923  * 

[78]  *  1921,1923,1928* 

[79]  *  1921-1928 

t 

[80]  ’ 9 'ABSTRACT'  '  IS  TYPED ,  AND  THE  USER  INPUTS  THE  ABST 
RACT 


[81] 

[82] 

[83] 

[84] 

[85] 


*  * 'MORE?' '  IS  TYPED' 

'  IF  THE  RESPONSE  IS  "NO"  9  NOTHING  IS  STORED  U 

NDER ' 

'  THAT  DOCUMENT  NUMBER ,  AND  CONTROL  IS  RETURNED' 

'  TO  SARA CNL * 

'  IF  ''YES'',  THE  NEXT  DOCUMENT  NUMBER  IS  TYPED ' 


Car] 


8ser ,  esejMse  i 


cmoxa  bi  shiibtovi  , *»ob»*  *x  bbx  xx 

BX  XOBXBOO  BB*  fftSBW3B  XKBX 


. 


■ 


APPENDIX  C 


EXAMPLES  OF  USE  OF  SARA 


Example  1  (STORE  routine) 

SARACNL 

FULL  EXPLANATION?  YES  OH  NO 
I 

YOU  ARE  UNDER  THE  CONTROL  OF  THE  MAINLINE  ROUTINE 
YOU  HAVE  3  ROUTINES  AVAILABLE  FOR  PROCESSING 
IN  ORDER  TO  ENTER  THE  REQUIRED  ROUTINE ,  TYPE  ITS  NAME  AFTER 
'GO'  IS  TYPED 

IN  ORDER  TO  RETURN  CONTROL  TO  THIS  ROUTINE ,  TYPE  'END'  AT 
ANY  INPUT  TIME 


FIND 

ACCEPTS  INPUT  REQUESTS  AND  RETURNS  DOCUMENT  NUMBERS 
SATISFYING  THE  REQUEST 

PRINT 

LISTS  ANY  PART  OF  THE  DOCUMENT  WITH  THE  GIVEN  DOCUMENT 
NUMBER 

STORE 

ALLOWS  STORAGE  OF  DOCUMENTS 

END 

RETURNS  CONTROL  TO  THE  APL  SYSTEM 


GO 

STORE 

FULL  EXPLANATION? 

YES 

INITIALLY ,  THE  CURRENT  DOCUMENT  NUMBER  IS  LISTED  AND  'MORE?' 
IS  TYPED 

IF  THE  RESPONSE  IS  'NO',  NOTHING  IS  STORED  UNDER  THAT 

DOCUMENT  NUMBER,  AND  CONTROL  IS  RETURNED  TO  SARACNL 
IF  'YES'  IS  TYPED,  WE  CONTINUE 


'TITLE'  IS  TYPED,  AND  THE  USER  INPUTS  THE  TITLE 
'AUTHOR'  IS  TYPED,  AND  THE  USER  INPUTS  THE  AUTHOR 
' JOURNAL '  IS  TYPED,  AND  THE  USER  INPUTS  THE  JOURNAL  IN 
THE  FORMAT 

JOURNALNAME ,  VOL.  XX 

THE  VOLUME  NUMBER  MAY  BE  OMITTED 
' X TERMS '  IS  TYPED,  AND  THE  USER  INPUTS  THE  INDEX  TERMS, 
ONE  AT  A  TIME.  TO  TERMINI ATE ,  TYPE  'END' 


r  ttv,  2  '  CV  '  "■  YIM 

, 

Wl'lZXZ  W  i  %9X  Ot’  4M«KH>  MWWJi 

W1 


; . 1  ,v  "  .  «  •’ 

•  xr>  :  ;  •  ,:  .  -  it  ;  /  N  ••  ’  vl 

' 


r:  t  '  V-:  r/AK  ,  : '  .  :  ?l  'f.^Wk* 

i 


\  '  ':■  ■  '  ^  ‘  '  VA  ?-Vi?  ATX' 


155 


'YEARS'  IS  TYPED ,  AND  THE  USER  INPUTS  THE  YEARS  IN  ONE  OF 
THE  FOLLOWING  FORMATS : 

1921 

1921,1923 

1921,1923,1928 

1921-1928 

' ABSTRACT '  IS  TYPED ,  AND  THE  USER  INPUTS  THE  ABSTRACT 

THE  NEXT  DOCUMENT  NUMBER  IS  THEN  TYPED 

DOC  NO  5 
TITLE 

A  UNIVERSITY  IN  TROUBLE 

AUTHOR 
THOMPSON  -WP 

JOURNAL 

SASKHIST 

XTERMS 

EDU 

UNIVERSITY 

INVALID  XTERM  -  UNIVERSITY 
UNIVERSITIES 

UNI  VERS IT Y -OF -SA SKAT  CHEW A N 

END 

YEARS 

1919 

ABSTRACT 


MORE? 

NO 

GO 

END 

ALL  DONE 


emit  z\  eeewje  trawnoo  ixxvi  aai 


<W-E02,ltt0f» 


W& 

hi^hniem 


HIEE’m  l\J  ■■  l«M  dllkWA 


VI  kvisEOi  k*E  ko - 10-  rn  awv 1  w 


■ 


00 


156 


Example  2  (FIND  routine) 


SARACNL 

FULL  EXPLANATION?  YES  OR  NO 

NO 

GO 

FIND 

FIND  FULL  EXPLANATION?  YES  OR  NO 
I 

IF  WEIGHTS  ARE  TO  BE  ATTACHED  TO  EACH  INDEX  TERM ,  TYPE  'YES' 
WHEN  REQUESTED  ( EQUAL  WEIGHTS?) ; 

ELSE, 'NO' 

THEN ,  TYPE  THE  REQUEST  IN  STANDARD  BOOLEAN  FORM 
WHEN  ' EXPANSION ?'  IS  REQUESTED ,  ONE  OF  FOUR  WORDS  ( OR  THEIR 
ABBREVIATIONS  AS  IN  THE  FOLLOWING  PARENTHESES) 

IS  EXPECTED 

NONE  (NON) 

NO  EXPANSION  OF  THE  REQUEST 
GENERAL  (GEN) 

INCLUDE  GENERAL  TERMS  IN  AN  OR  RELATIONSHIP 
SPECIFIC  (SPE) 

INCLUDE  SPECIFIC  TERMS  IN  AN  OR  RELATIONSHIP 
RELATED  (REL) 

INCLUDE  RELATED  TERMS  IN  AN  OR  RELATIONSHIP 

WHEN  REQUESTED  ' START  SCAN?'  TYPE  'YES'  OR  'NO',  IF 

'YES',  A  SEARCH  OF  ALL  DOCUMENTS  IS  STARTED ;  IF  'NO', 

A  REQUEST  MAY  BE  RE-ENTERED 

WHEN  THE  SEARCH  HAS  BEEN  COMPLETED,  THE  DOCUMENT  NUMBERS 
MAY  BE  LISTED  BY  TYPING  'YES'  TO  ' OUTPUT  DOC  NOS?'; 
ELSE,  'NO' 

THEN  ' MORE  REQUESTS ?'  IS  TYPED;  IF  THE  SAME  REQUEST  IS  TO  BE 
EXPANDED  DIFFERENTLY ,  TYPE  'SAME'; 

ELSE ,  'NO' ,' YES ' ,  OR  ' END ' 

EQUAL  WEIGHTS? 

Y 

( PARTIES  SELECTIONS ) aSASKATCHEWAN * ( PER ;>  1  9  0  5  )  a  ( PER* 1 930  ) 
EXPANSION? 

NONE 

START  SCAN? 

Y 

COUNT  =  1 
OUTPUT  DOC  NOS? 


qeir 


'ZZZ'  W!t  .WIET  UdUl  EOXR  0*1  QRR0AT1A  IE  01  ERA  EtEOlRVl  El 


mtROlEVt  EXE©*)  ORIEROORR  RERVl 
RROE  RARdOOR  OR AGE AIR  El  IRRV19RR  ERl  REtl  ,RRR1 
(RERIEIRERAE  ORlROddOR  EE’S  El  EX  RR01IA1ERRRHA 

IRROOER  RE1  EO  R01RRARIE  OR 

(*:■  .  r.)'-  :  ';-;  v,>  \ 

.*0%'  EO  'REI'  E Ell  'SRAOR  IRAIE’  GllEROOER  EEEE 

,»0R»  El  { 0R1RATR  El  El EERO 000  11 X  EO  EORARR  X  . 'ERI* 


RRiattOR  1R1WJOOG  ER1  .ORlEdEROO  RERE  EXE  ROE AIR  EE’S  RRE'4 
i’SEOR  000  10RT001  01  »REI»  0R1EII  IE  ORlEld  EE  TAtt 

’ORE*  RO  ,»EEI»/»0E»  « EE  IE 

> 


€ 9 1 2REE ) a (  2 oe  tsREE )  a*  XRl ROI AAR ARa CRR0I10A1E v’dil tR XE ) 

SROIRRAHXR 

x  2  SROOO 
SROR  000  lORIVJO 


157 


I 


1  13 

MORE  REQUESTS? 

SAME 

EXPANSION? 

GENERAL 
START  SCAN? 

Y 

COUNT  =  6 
OUTPUT  DOC  NOS? 

Y 

1  13 

2  14 

3  11 

4  37 

5  19 
MORE? 

1 

6  17 

MORE  REQUESTS? 

Y 

EQUAL  WEIGHTS? 

NO 

FAM*SASKATCHEWANa(PERZ1860) 
WEIGHT?  -  $AM 
12 

WEIGHT?  -  SASKATCHEWAN 
25 

WEIGHT?  -  PER 

2 

EXPANSION? 

NONE 

START  SCAN? 

Y 

COUNT  -  1 
OUTPUT  DOC  NOS? 

Y 


1  4 

MORE  REQUESTS? 

same 

EXPANSION? 
RELATED 
START  SCAN? 

Y 

COUNT  =  3 
OUTPUT  DOC  NOS? 


taoiaaiaia 

*aO*l  OOQL  1\JTS\J0 


raifcaa&aft  BfcOH 
k  \ZmiVA  aKW 

(038XSflS^)A^^BOT^^A^ 

Htfi  -  stasia* 


■ 

aaa  -  sTarati 


i  f  w 

saoa  soa  Tvmvjo 


saittavs&aa  aao* 
an# 
™oia*x<m 


158 


Y 


1  11 
2  6 
3  4 

MORE  REQUESTS? 

NO 

GO 


FIND  NO 

EQUAL  WEIGHTS? 

I 

PRESBYTERIAN A ( PER* 1 8 5  0  ) a ( PERU  9  2  0  ) *(MISSIONSv MISSIONARIES ) A 

SASKEIST 

EXPANSION? 

NONE 

START  SCAN? 

I 

COUNT  -  1 
OUTPUT  DOC  NOS? 

N 

MORE  REQUESTS? 

SAME 

EXPANSION? 

GENERAL 
START  SCAN? 

I 

COUNT  =  7 
OUTPUT  DOC  NOS? 

Y 


1  10 

2  3 

3  1 

4  4 

5  6 
MORE? 

Y 

6  40 

7  31 

MORE  REQUESTS? 

NO 

GO 


* 


MM  OOd  T:\m\10 


mkZ 

■ 

^0*  ^Qd  TOOTO 

01 


v  MKM 
1 


tOTiiusift  iftow 


159 


FIND  NO 

EQUAL  WEIGHTS? 

NO 

TRAVEL EXPLORERS^  SASKATCHEWAN ''/ALBERTA  )  aSASKHIST 
WEIGHT?  -  TRAVEL 
15 

WEIGHT?  -  EXPLORERS 
20 

WEIGHT?  -  SASKATCHEWAN 
10 

WEIGHT?  -  ALBERTA 
25 

WEIGHT?  -  SASKHIST 
5 

EXPANSION? 

NONE 

START  SCAN? 

I 

COUNT  =  0 
MORE  REQUESTS? 

SAME 

EXPANSION? 

GENERAL 
START  SCAN? 

YES 

COUNT  =  3 
OUTPUT  DOC  NOS? 

I 

1  42 

2  41 

3  38 

MORE  REQUESTS? 

NO 

GO 


araftocwx*  -  ttwiw 


1UN*OT£Mftl£  -  STfcSXSVt 


XI&atUA  - 

•s^Iax^x^  -  sxaaitt* 


%ZkZZ  IZkIZ 


o  «  twin 

xao* 


wkzz  mm 

z%\ 


jt 

tvmwa%  a*o* 

■  ■  '--' 


160 


FIND  NO 

EQUAL  WEIGHTS? 

I 

MA CKA  Y-JA  a REL a  (  PER <.  1  9  0  7  )  a  (  ~GEO  ) 
EXPANSION ? 

NONE 

START  SCAN? 

I 

COUNT  =  2 
OUTPUT  DOC  NOS? 

X 


1  3  • 

2  1 

MORE  REQUESTS? 
SAME 

EXPANSION? 
SPECIFIC 
START  SCAN? 

Y 

COUNT  =  2 
OUTPUT  DOC  NOS? 
I 

1  3 

2  1 

MORE  REQUESTS? 

NO 

GO 


\ 


iawa 


W«r.w« 

««* 


SfcTSaVI&lft  *A%CM 


^  .  ."  ;  •  \ 


1 6 1 


FIND  NO 

EQUAL  WEIGHTS? 

NO 

LOCAL-GOVERNMENT a(NORTHWEST-TERRITORIESvPRINCE -ALBERT 
vBALCARRES)*(PERZ1305) 

WEIGHT?  -  LOCAL-GOVERNMENT 

4 

WEIGHT?  -  NORTHWEST -TERRITORIES 

5 

WEIGHT?  -  PRINCE-ALBERT 
7 

WEIGHT?  -  BAL CARRES 
2 

WEIGHT?  -  PER 
1 

EXPANSION? 

NONE 

START  SCAN? 

YES 

COUNT  -  2 
OUTPUT  DOC  NOS? 
y 


1  21 
2  20 

MORE  REQUESTS? 
SAME 

EXPANSION? 
RELATED 
START  SCAN? 

I 

COUNT  =  6 
OUTPUT  DOC  NOS? 

Y 

1  10 
2  8 

3  18 

4  21 

5  20 
MORE? 

Y 

6  24 

MORE  REQUESTS? 

NO 

GO 


ifcTRWA ) A’mWftaSVOO-a MSI 

(  30ei55R^)A(^,a^KO^Kav 

TViiwmNoo-akooa  -  s’sastfraa 


aaiftOUftftlT-’mMfctftO*  -  SiaaVAVi 
wv  a-^saviaa  ■•  v^iav, 
aaaaxoaxa  -  waia* 


aaa  -  s’mia* 

■*!icw 

s  *  twxtt 

taca  ooa  ivjtjvjo 


os 

txoiaax^m 
<uxoa  lax^a 


o  §  *  iavjo^ 


,  or  *r 

o 


• 

8 


162 


Example  3  (PRINT  routine) 


SARACNL 

FULL  EXPLANATION?  YES  OR  NO 

N 

GO 

GO 

INCORRECT  COMMAND 
GO 

PRINT 

FULL  EXPLANATION?  YES  OR  NO 
Y 

THE  FORMAT  OF  THE  INPUT  TO  THE  ' PRINT1  SUBSYSTEM  IS  AS  FOLLOWS : 

DOC  NO,  OPTION  1  (,  OPTION  2)  (,  OPTION  3)  ... 

DOC  NO  IS  THE  DOCUMENT  NUMBER  CONCERNED 
SIX  OPTIONS  ARE  AVAILABLE 

TITLE  ( TI ) 

THE  TITLE  IS  LISTED 
AUTHOR  ( AU ) 

THE  AUTHOR  IS  LISTED 
JOURNAL  (< TO) 

THE  SOURCE  JOURNAL  IS  LISTED 
XTERMS  (XT) 

THE  INDEX  TERMS  AND  THE  PERIOD  THE  DOCUMENT  DEALS  WITH  ARE 
LISTED 

ABSTRACT  (AB) 

THE  ABSTRACT  IS  LISTED 
ALL  (AL) 

ALL  THE  ABOVE  OPTIONS  ARE  LISTED 
TO  EXIT  FROM  THE  ROUTINE,  TYPE  'END' 


PRINT  WHAT? 

16 ,TITLE , AUTHOR 

DOMINION  GOVERNMENT  AID  TO  THE  DAIRY  INDUSTRY  IN  WESTERN  CANADA, 
1890-1906 
CHURCH-GC 

PRINT  WHAT? 

11 ,XT,AB 

YEARS  -1860  0  1919 

BIO  POL  BROWN-GW  SASKATCHEWAN  NORTHWEST-TERRITORIES  POLITICIANS 
NONE 


PRINT  WHAT? 


EaakaikMk  aak  evioiiio  m 


(it)  i^tit 

icmEia  et  ^aTii  eet 

<\5k)  EOimk 

astEia  ei  EQTOJk  eei 

(o\)  akwum 

ETiEia  ei  akimoT*  eeeeoe  *at 

•AHk  Eakau  trauma  eet  eqieei  Eat  aak  Ewist  ia<m  Eat 

aatEia 

<!EtEia  EI  tOkatESk  EET 


<ak)  aa\ 


Q’iTSia  33X  330X130  310BX  3R1  ills 

V 


•laa’  attt  ,aait\m  ait  vioia  tin  ot 


■ 


aoei-oeer 


tXMN  13188 


eieiv  o 

ElklEITiaOl  EailOtlliat -  tEEVI It  ECU  EkVI HEtklEkE  Ifc-WOll  101  Oil 


1101 


23  ,ALL 

QUIET  EARTH ,  BIG  SKI 

STEGNER-W 

SASKHIST 

YEARS  -  1915  0  1919 

SO C  E AS TEND  SASKATCHEWAN  PIONEER-LIFE 
NONE 

PRINT  WHAT? 

END 

GO 

END 

ALL  DONE 


laiaie  .-a 

SBH-BBBBOIB  BWlBBOIMiaKB  OBBIBk? 

BBS 
00 

bboo  oak 


•  : 


B29871 


